Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
dbeavon3
Power Participant
Power Participant

Queued time is elevated - can this trigger a notebook failure?

There were a series of notebooks that ran from 3 am to 7 am.  Only one of them failed.

 

dbeavon3_2-1736987508804.png

 

The most obvious issue in that failed notebook (aside from the errors that appear in stderr) is the elevated duration of time spent at the "queued" status.  I'm not totally sure if this elevated duration was just an additional symptom of the failed notebook, or of the long time on queue is what CAUSED the notebook to fail.

 

The failed notebook took almost ~10 mins at the queued status:

 

 

dbeavon3_0-1736987155578.png

 

 

... whereas all the other notebooks (40 of them) spent only about two minutes at the queued status

 

The particular notebook tried to get started at 5:30 AM, about half way thru the whole parallelized batch of 40 notebooks.

 

The error appears to be related to a "livy" bug.  (No this is not your daddy's livy, it is a home-grown Microsoft flavor of livy). 

See below.

 

dbeavon3_1-1736987430523.png

 

 

 

Unfortunately all the error messages in stderr are totally meaningless, and the code that is failing is totally proprietary, and there is no spark monitoring tool for me to investigate my theory below.

 

The only theory I have is that "pool" had some sort of lifecyle event, at the time this notebook needed to launch.  The pool may have needed to "autoscale" (they are medium nodes, and autoscale is enabled up to 5 nodes).  The need to autoscale caused a series of unfortunate events...  Eg. this autoscaling may have forced a delay in the launching of the notebook.  And after the notebook finally started running on the cluster, the Livy orchestration software decided that ten minutes was an unacceptable delay and Livy then performed a self-destruction at the very start of the notebook session.

 

 

This theory is based on a lot of guesswork.  Eg it is based on the fact that the delay was about 10 minutes (not just 1 or 2 minutes).  And that 40 other notebooks were able to run without issues.  And because livy (orchestration) code is a very minor part of the overall work that is done.  And because that livy code is one of the few homegrown components that is managed entirely within this Fabric PG team (not by open source Apache).  And it assumes that the ten minute delay and the error messages are related by causation. 

 

I'll add the full stack here as well, if it helps with googling:

 

 

2025-01-15 10:42:13,827 WARN AbstractChannelHandlerContext [RPC-Handler-7]: An exception 'java.lang.IllegalArgumentException: not existed channel:[id: 0xbb421024, L:/10.1.96.7:10001 ! R:/10.1.96.7:45378]' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.IllegalArgumentException: not existed channel:[id: 0xbb421024, L:/10.1.96.7:10001 ! R:/10.1.96.7:45378]
	at org.apache.livy.rsc.rpc.RpcDispatcher.getRpc(RpcDispatcher.java:67)
	at org.apache.livy.rsc.rpc.RpcDispatcher.channelInactive(RpcDispatcher.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at org.apache.livy.rsc.rpc.Rpc$5.channelInactive(Rpc.java:244)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
	at io.netty.handler.codec.ByteToMessageCodec.channelInactive(ByteToMessageCodec.java:118)
	at org.apache.livy.rsc.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:100)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.base/java.lang.Thread.run(Thread.java:829)
2025-01-15 10:42:13,826 ERROR ApplicationMaster [Driver]: User class threw exception: 
java.lang.Exception: Session is unable to register ReplId: default
	at org.apache.livy.repl.ReplDriver.getOrCreateSession(ReplDriver.scala:419)
	at org.apache.livy.repl.ReplDriver.initializeSparkEntries(ReplDriver.scala:93)
	at org.apache.livy.rsc.driver.RSCDriver.run(RSCDriver.java:358)
	at org.apache.livy.rsc.driver.RSCDriverBootstrapper.main(RSCDriverBootstrapper.java:97)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 

One way to prove the theory is if Microsoft would interpret the errors for me.  (I'm opening a case with Mindtree.  I hope to provide a follow-up in the next five weeks).

 

Another possible way to prove the theory is to play with incremental changes in the autoscale of the pool.  I don't look forward to investigating this Microsoft bug by trial-and-error.  Hopefully someone else has reached out to Microsoft about this issue.  Please let me know.

 

 

4 REPLIES 4
v-tsaipranay
Community Support
Community Support

Hi @dbeavon3 

Thank you for reaching out to the Microsoft fabric community forum.

 

Yes, elevated queued time can potentially trigger a notebook failure. 

It appears that the issue is related to a delay in resource allocation due to autoscaling, which triggered Livy’s timeout mechanisms and caused a failure. To address this, I recommend adjusting the autoscaling configuration and reviewing Livy’s timeout settings.

 

I am including a simpler thread that has already been resolved. Please review it to gain a better understanding.

https://community.fabric.microsoft.com/t5/Data-Engineering/notebook-queued-time/m-p/4103235

 

I hope my suggestions give you good ideas, if you need any further assistance, feel free to reach out.

If this post helps, then please give us Kudos and consider Accept it as a solution to help the other members find it more quickly.

 

Thank you. 

 

 

Hi @v-tsaipranay 

Thank you for your time.  You seem to be familiar with the Fabric flavor of Spark (and that is not a very common thing).  I have an ongoing "pro" support case underway for almost a week, and nobody knows what the platform is doing.  (The engineers have opened a "collab" with the ADF/pipeline team; and I'm guessing that the ADF team will cause another three-day delay for no good reason....)


It is disappointing to hear your conclusion that Livy timeout mechanisms might be causing failures. Do you have any reference material that describes the ten minute timeout?  The docs you shared indicate that livy jobs can remain queued up to 24 hours.  It is not ten mins, from what I can see.

 

The folllowing link describes a queue restriction that is based on the number of jobs that can be queued (the max time is always 24 hours) 
Concurrency limits and queueing in Apache Spark for Fabric - Microsoft Fabric | Microsoft Learn

 

dbeavon3_0-1737471227072.png

 


... in any case I don't think this is relevant to my error since any type of a queue capacity restriction would give a pretty obvious error message to the user.  Whereas the error I'm seeing is an unfriendly error message in the stderr of the driver (ie. "Session is unable to register ReplId: default").   It has the feeling of a bug.

 

 

 

Ten minutes is extremely short.  Especially if there is no visibility for customers to see the backlog growing on the cluster.  If we could review these lifecycle operations in the cluster with our own eyes, then it would be easy to correlate the notebook failures to whatever is happening in the cluster.  But in Fabric they won't give any surface area to monitor the cluster, because they want to make things "easy".

 

In some of the other Livy implementations (Synapse and HDI) the ten minutes on the livy queue would not cause any failure.  In fact we could send dozens of jobs that take 30 mins each to complete (five running at a time), and we never have to worry about these livy timeouts!  


Let me know if you can share docs that describe the error (Session is unable to register ReplId)

I would not be spending an entire week working with with a half-dozen engineers at Mindtree, if the error simply explained that there was a livy timeout of some kind.  Eg. "A Livy timeout (10 mins) is preventing any additional notebooks from being executed at this time.  Please try submitting your notebook to this pool on a different day." 

 

 

 

nilendraFabric
Resolver II
Resolver II

The Livy-related error is likely a symptom rather than the root cause. Microsoft Fabric uses a custom implementation of Livy, which manages Spark sessions. The 10-minute delay might have exceeded an internal timeout, causing Livy to fail the session creation.

frithjof_v
Community Champion
Community Champion

I don't know the answer to the question. You can also consider posting on the Fabric subreddit: https://www.reddit.com/r/MicrosoftFabric/s/pg1qg3tNN3

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Jan NL Carousel

Fabric Community Update - January 2025

Find out what's new and trending in the Fabric Community.