Solved: Queued time is elevated - can this trigger a noteb...

dbeavon3 · ‎01-15-2025

There were a series of notebooks that ran from 3 am to 7 am. Only one of them failed.

The most obvious issue in that failed notebook (aside from the errors that appear in stderr) is the elevated duration of time spent at the "queued" status. I'm not totally sure if this elevated duration was just an additional symptom of the failed notebook, or of the long time on queue is what CAUSED the notebook to fail.

The failed notebook took almost ~10 mins at the queued status:

... whereas all the other notebooks (40 of them) spent only about two minutes at the queued status

The particular notebook tried to get started at 5:30 AM, about half way thru the whole parallelized batch of 40 notebooks.

The error appears to be related to a "livy" bug. (No this is not your daddy's livy, it is a home-grown Microsoft flavor of livy).

See below.

Unfortunately all the error messages in stderr are totally meaningless, and the code that is failing is totally proprietary, and there is no spark monitoring tool for me to investigate my theory below.

The only theory I have is that "pool" had some sort of lifecyle event, at the time this notebook needed to launch. The pool may have needed to "autoscale" (they are medium nodes, and autoscale is enabled up to 5 nodes). The need to autoscale caused a series of unfortunate events... Eg. this autoscaling may have forced a delay in the launching of the notebook. And after the notebook finally started running on the cluster, the Livy orchestration software decided that ten minutes was an unacceptable delay and Livy then performed a self-destruction at the very start of the notebook session.

This theory is based on a lot of guesswork. Eg it is based on the fact that the delay was about 10 minutes (not just 1 or 2 minutes). And that 40 other notebooks were able to run without issues. And because livy (orchestration) code is a very minor part of the overall work that is done. And because that livy code is one of the few homegrown components that is managed entirely within this Fabric PG team (not by open source Apache). And it assumes that the ten minute delay and the error messages are related by causation.

I'll add the full stack here as well, if it helps with googling:

2025-01-15 10:42:13,827 WARN AbstractChannelHandlerContext [RPC-Handler-7]: An exception 'java.lang.IllegalArgumentException: not existed channel:[id: 0xbb421024, L:/10.1.96.7:10001 ! R:/10.1.96.7:45378]' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.IllegalArgumentException: not existed channel:[id: 0xbb421024, L:/10.1.96.7:10001 ! R:/10.1.96.7:45378]
	at org.apache.livy.rsc.rpc.RpcDispatcher.getRpc(RpcDispatcher.java:67)
	at org.apache.livy.rsc.rpc.RpcDispatcher.channelInactive(RpcDispatcher.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at org.apache.livy.rsc.rpc.Rpc$5.channelInactive(Rpc.java:244)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
	at io.netty.handler.codec.ByteToMessageCodec.channelInactive(ByteToMessageCodec.java:118)
	at org.apache.livy.rsc.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:100)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.base/java.lang.Thread.run(Thread.java:829)
2025-01-15 10:42:13,826 ERROR ApplicationMaster [Driver]: User class threw exception: 
java.lang.Exception: Session is unable to register ReplId: default
	at org.apache.livy.repl.ReplDriver.getOrCreateSession(ReplDriver.scala:419)
	at org.apache.livy.repl.ReplDriver.initializeSparkEntries(ReplDriver.scala:93)
	at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:358)
	at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:97)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

One way to prove the theory is if Microsoft would interpret the errors for me. (I'm opening a case with Mindtree. I hope to provide a follow-up in the next five weeks).

Another possible way to prove the theory is to play with incremental changes in the autoscale of the pool. I don't look forward to investigating this Microsoft bug by trial-and-error. Hopefully someone else has reached out to Microsoft about this issue. Please let me know.

v-tsaipranay · ‎01-23-2025

Hi @dbeavon3 ,

Thank you for the detailed information and understanding you provided. I apologize for any inconvenience caused.

As you mentioned, you are facing an issue with "Session is unable to register ReplId". This appears to be a new issue for Microsoft, and we appreciate you bringing it to our attention.

To ensure this is addressed effectively, I recommend you submit a support ticket directly to Microsoft using this link: https://learn.microsoft.com/en-us/power-bi/support/create-support-ticket

Thankyou.

View solution in original post

v-tsaipranay · ‎01-21-2025

Hi @dbeavon3

Thank you for reaching out to the Microsoft fabric community forum.

Yes, elevated queued time can potentially trigger a notebook failure.

It appears that the issue is related to a delay in resource allocation due to autoscaling, which triggered Livy’s timeout mechanisms and caused a failure. To address this, I recommend adjusting the autoscaling configuration and reviewing Livy’s timeout settings.

I am including a simpler thread that has already been resolved. Please review it to gain a better understanding.

https://community.fabric.microsoft.com/t5/Data-Engineering/notebook-queued-time/m-p/4103235

I hope my suggestions give you good ideas, if you need any further assistance, feel free to reach out.

If this post helps, then please give us Kudos and consider Accept it as a solution to help the other members find it more quickly.

Thank you.

dbeavon3 · ‎01-21-2025

Hi @v-tsaipranay

Thank you for your time. You seem to be familiar with the Fabric flavor of Spark (and that is not a very common thing). I have an ongoing "pro" support case underway for almost a week, and nobody knows what the platform is doing. (The engineers have opened a "collab" with the ADF/pipeline team; and I'm guessing that the ADF team will cause another three-day delay for no good reason....)

It is disappointing to hear your conclusion that Livy timeout mechanisms might be causing failures. Do you have any reference material that describes the ten minute timeout? The docs you shared indicate that livy jobs can remain queued up to 24 hours. It is not ten mins, from what I can see.

The folllowing link describes a queue restriction that is based on the number of jobs that can be queued (the max time is always 24 hours)
Concurrency limits and queueing in Apache Spark for Fabric - Microsoft Fabric | Microsoft Learn

... in any case I don't think this is relevant to my error since any type of a queue capacity restriction would give a pretty obvious error message to the user. Whereas the error I'm seeing is an unfriendly error message in the stderr of the driver (ie. "Session is unable to register ReplId: default"). It has the feeling of a bug.

Ten minutes is extremely short. Especially if there is no visibility for customers to see the backlog growing on the cluster. If we could review these lifecycle operations in the cluster with our own eyes, then it would be easy to correlate the notebook failures to whatever is happening in the cluster. But in Fabric they won't give any surface area to monitor the cluster, because they want to make things "easy".

In some of the other Livy implementations (Synapse and HDI) the ten minutes on the livy queue would not cause any failure. In fact we could send dozens of jobs that take 30 mins each to complete (five running at a time), and we never have to worry about these livy timeouts!

Let me know if you can share docs that describe the error (Session is unable to register ReplId)

I would not be spending an entire week working with with a half-dozen engineers at Mindtree, if the error simply explained that there was a livy timeout of some kind. Eg. "A Livy timeout (10 mins) is preventing any additional notebooks from being executed at this time. Please try submitting your notebook to this pool on a different day."

v-tsaipranay · ‎01-23-2025

Hi @dbeavon3 ,

Thank you for the detailed information and understanding you provided. I apologize for any inconvenience caused.

As you mentioned, you are facing an issue with "Session is unable to register ReplId". This appears to be a new issue for Microsoft, and we appreciate you bringing it to our attention.

To ensure this is addressed effectively, I recommend you submit a support ticket directly to Microsoft using this link: https://learn.microsoft.com/en-us/power-bi/support/create-support-ticket

Thankyou.

dbeavon3 · ‎01-24-2025

Hi @v-tsaipranay

Yes there is a support ticket. Those take some time and effort. On rare occasion I can get more timely support here, than with Mindtree & Microsoft "pro" support.

My only goal for the support ticket is to get this issue added to the "known issues" list. If you never see it on that list, then you will know that I have failed.

v-tsaipranay · ‎02-03-2025

Hi @dbeavon3 ,

We apologize for any inconvenience this may cause and appreciate your patience as we work to resolve the issue. Microsoft engineers are actively investigating to find a solution.

We understand that support tickets require time and effort, and we value your cooperation.

Our priority is to ensure this issue receives the necessary attention. While we cannot guarantee immediate inclusion on the "known issues" list, please be assured that we are working closely with the support teams to escalate its visibility.

If you have any further concerns, please feel free to reach out.

Thank you for being a valued member of the Microsoft Fabric Community!

nilendraFabric · ‎01-20-2025

The Livy-related error is likely a symptom rather than the root cause. Microsoft Fabric uses a custom implementation of Livy, which manages Spark sessions. The 10-minute delay might have exceeded an internal timeout, causing Livy to fail the session creation.

frithjof_v · ‎01-16-2025

I don't know the answer to the question. You can also consider posting on the Fabric subreddit: https://www.reddit.com/r/MicrosoftFabric/s/pg1qg3tNN3

Queued time is elevated - can this trigger a notebook failure?

Helpful resources

Fabric Monthly Update - September 2025

Fabric Community Update - August 2025

FabCon is coming to Atlanta

Queued time is elevated - can this trigger a notebook failure?

Helpful resources

Fabric Monthly Update - September 2025

Fabric Community Update - August 2025