Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
dbeavon3
Power Participant
Power Participant

Another meaningless spark error: Session is unable to register ReplId

There is a new error today.  This beauty is in the stderr of the driver:

 

Session is unable to register ReplId: default

 

 

It is frustrating to use this pyspark environment.  Every pyspark environment from Microsoft seems worse than the previous.  HDInsight is wonderful, Synapse was pretty bad, and Fabric is terrible.  ... I'm opening spark support tickets with Mindtree at a faster rate than they are going to Microsoft.  Since I already have several tickets without an ICM, I'm not eager to continue creating these.  But I want to put this here so it can be googled, at the very least.

 

The notebook session won't be created.  The most prominent error stack in stderr of the driver looks like so:

 

2025-01-15 10:42:14,038 INFO ApplicationMaster [shutdown-hook-0]: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.Exception: Session is unable to register ReplId: default
at org.apache.livy.repl.ReplDriver.getOrCreateSession(ReplDriver.scala:419)

 

dbeavon3_0-1736979015287.png

 

... it is presented as a WARN in the logs, but the ultimate impact is that my notebook won't start.

 

 

Earlier in the stderr, there is a log output with the  ERROR severity, but it is even less meaningful than the WARN:

 


2025-01-15 10:42:12,846 ERROR ServerRuntime$Responder [SparkUI-85]: An I/O error has occurred while writing a response message entity to the container output stream.

 

dbeavon3_1-1736979330792.png

 

 

Earlier than that are some other WARN's, not sure if they are relevant:

dbeavon3_2-1736979596628.png

 

If anyone has seen these errors, or if you are already working on a support ticket please let me know.  I will be very grateful if you can spare me a week or two of effort.

 

 

12 REPLIES 12
v-vpabbu
Community Support
Community Support

Hi @dbeavon3,

 

Thank you for reaching out to Microsoft Fabric Community Forum.

 

Please verify if there are any known issues or bugs with the specific version of Microsoft Fabric.

If there are no bugs, please try again later; it might be a temporary glitch with the backend services.

 

Did I answer your question? Mark my post as a solution, this will help others!
If my response(s) assisted you in any way, don't forget to give "Kudos"

 

Regards,
Vinay Pabbu

@v-vpabbu 

Thank you for the reply.


In this case, I seem to be the first one to ever post about this message to the public internet.

 

Regardless if it is transient or not, customers need to understand the meaning of any error message.  We need to know if errors can be fixed in our own custom code or must be fixed in Microsoft's code.  We need to know if the fix will be considered a "workaround" or a "permanent" fix.    We need to know if retries should be done x3 times or x30 times.  We need to know if the problem is affecting all customer tenants, or if they are isolated to just our own customer tenant.

 

It is extremely unlikely that this is a "temporary glitch" which I will never see again.  I should mention that I've only been using Fabric Spark for a few days, and I've encountered many, many errors which are meaningless to me.  I have experience with other Spark platforms and in most cases we are able to google for an explanation about any error message.  In most Spark platforms the error messages can be fixed by correcting our own code.  However the bugs I'm seeing in this Fabric flavor of Spark seem to originate from Microsoft's home-grown extensions to Spark.  (like Livy, for example)

 

If you are able to interpret this error message or the callstack please let me know.  You could save me a LOT of time.  The Mindtree support cases take many weeks, while waiting on a response from the back-end PG.  I have started that process, but nobody from Microsoft has engaged with me yet.

 

 

 

 

Hi @dbeavon3,

 

I apologize for the challenges you're facing. I understand the importance of having clear explanations and resolutions for error messages. I suggest raising a support ticket. If you have already done so, I kindly request you to wait for the response.

How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

 

Regards,

Vinay Pabbu

@v-vpabbu 

I have tried raising a support ticket, but the Microsoft PG has not started to engage with this bug yet. 

 

As-of now I am only talking to the intermediary named Mindtree.  As a customer I really expect the SaaS software vendor to engage when there are bugs.  Until that happens, I must also reach out to this community and to other social media.  There is a fairly significant chance that others are encoungering thes bugs as well, but they have not been added to the "known issues" yet.

Hi @dbeavon3,

 

I understand your frustration and concerns. Could you please provide an update on the status of your ticket? Has the issue been resolved or has the support team provided any solution ?

 

Regards,

Vinay Pabbu

Hi @v-vpabbu 


I haven't updated the community since this past Monday.


As I mentioned, I was blocked by SME at Microsoft for several days.  I think I'm unblocked now.  I am told there is an ICM now, but I can't independently verify whether that is true.

 

I doubt you have access to see the customer SR, but you are more than welcome to try.  The SR number is TrackingID#2501160040000052.

I am pretty satisfied with the effort on the Mindtree side.  However I don't have high hopes for the back-end PG.  At the very minimum I would expect them to publish their bugs to their "known issues" list.  It doesn't seem they do that as a matter of course.  As-of now all the spark bugs we encounter are missing from the public list, and it requires a minimum investment of 10+ hours of time, along with an average wait time of two weeks before any ticket is sent to the PG (via a new ICM).

 

 

 

Hi @dbeavon3,

 

Thankyou, I completely understand your issue; however, I am unable to access ICMs with just the tracking ID. Could you please provide the ICM ID?
Since there is already an ICM in place, I hope it will be resolved as soon as possible based on the severity. I kindly request your patience until the ICM is resolved.

 

Regards,

Vinay Pabbu

 

 

Hi @v-vpabbu You should be able to find the ICM attached to the SR. 

Do you work for Mindtree?  It seems like you are unaware of their policies and procedures.  You are far more likely than me to get ahold of the ICM number.  You can reach out to the ops manager or to any Microsoft FTE. 

 

Fabric Bugs are so challenging to work on, starting with the fact that Microsoft/Mindtree won't even share the bug identifiers.  It is extremely frustrating.  In place of a real identifier, you can use this post or else the SR number.  Those can serve as the bug identifier, especially when opening a "unified" ticket about the bug.  In my experience you can get any identifiers you may need (like the ICM) by talking to an FTE.  A Microsoft employee has a lot fewer restrictions on what they can share about Microsoft bugs.

 

 

Hi @dbeavon3,

 

I understand your frustration, and I sincerely apologize for any inconvenience. I do not have direct access to track ICMs.

ICMs are managed by the support team, who prioritize them based on severity and issue type. Unfortunately, I do not have the scope to reach out to other members regarding ICMs. I recommend waiting for the support team’s updates, as they have full control over the resolution process.

Regards,

Vinay Pabbu

FYI, I closed the ticket, and got a very low level of support for the issue.

It turned out that VM's weren't being started or something like that.  I'm guessing that it is NOT a problem that can be addressed by the Fabric-Spark team themselves.  It is probably a cross-team problem.  These are the types of issues which Mindtree support is probably not equipped to work on.  I think they can hardly work on a bug that is isolated to the Fabric-Spark team, let alone one that involves VM provisioning in Azure.

My biggest complaint is that the error message surfaced to the customer is meaningless.  There are back-end logs (kusto?) where the engineers can dig into the failures in the cluster itself; but of course they don't allow customers to perform those investigations when our workloads are dying.  So we can't independently investigate these bugs, and if we ask Mindtree to do it, they can't either.  You have to wait about two weeks for the PG (Microsoft itself) to get on board. 

I suspect this bug is one that customers will see more frequently on certain azure regions.  Eg. we are hosting on North Central US.  I don't think there will ever be a permanent fix, but hopefully they will tweak the error messages that are surfaced to their customers one day:

Below is the minimal explanation of what can be seen in their kusto logs, when this bug is happening in our Spark jobs on Fabric.

 

dbeavon3_0-1738344203239.png

 

 

 

 

 

 

 

Hi @dbeavon3,

 

Thank you for providing detailed information from the support team. It will be helpful for other members of the community who have similar problems to solve them faster.

 

I suggest submitting your detailed feedback and ideas through Microsoft's official feedback channels, such as Microsoft Fabric Ideas. Feedback submitted through these channels is frequently reviewed by the product teams and can contribute to meaningful improvements.
https://ideas.fabric.microsoft.com/ideas/search-ideas/ 

 

Regards,

Vinay Pabbu

I wish the normal support channels were more effective.  I'm certainly not the only customer who encountered this bug in Spark, yet it feels like I'm fending for myself.  The Microsoft PG won't often fix bugs right away, nor publish the details to their "known issues" list.

 

This community is at least one place we can raise awareness about bugs. 

 

The only problem with this community is that I do NOT necessarily want to be the one providing future support to the other customers, after they encounter the same issue and fail to get meaningful support from Microsoft.

 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Prices go up Feb. 11th.

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

JanFabricDW_carousel

Fabric Monthly Update - January 2025

Unlock the latest Fabric Data Warehouse upgrades!

JanFabricDF_carousel

Fabric Monthly Update - January 2025

Take your data replication to the next level with Fabric's latest updates!