Solved: Question about internal bugs in "notebookUtils" "m...

dbeavon3 · ‎01-22-2025

I'm about ready to open my tenth bug in Fabric-Spark. They are taking me over 10+ hours just to transmit these to the PG and get them to engage.... . So I'm hoping someone has seen this error and I won't have to spend the next two weeks waiting on that Fabric-Spark team:

A pyspark notebook started executing cells today at 8:29:57 AM utc:

Not long after that it dies, when trying to open a file on LH:

... the error is pretty meaningless, as you can see above. A file that is in the LH cannot be read, and there is no good reason. This only affected one notebook execution out of about ~30 that were performed this morning.

After I review the "stderr" of the driver, it appears that this notebook session was doomed for failure before it even started. A "mount operation" had failed but had not prevented the execution of the notebook. This took place at 8:29:25 AM utc.

...

2025-01-22 08:29:25,506 INFO notebookUtils [pool-52-thread-1]: [mount operation] mount point /default/Files info has been updated with operationId 10c731d8-bdf8-4a86-a977-926f9d9cf7ee

...
2025-01-22 08:29:55,625 INFO notebookUtils [pool-52-thread-1]: [checkDriverNodeMountStatus] operationId: 10c731d8-bdf8-4a86-a977-926f9d9cf7ee check driver node cost 30116ms, key:lakehousemounts_vm-ae003372_/default/Files/, result: timeout
2025-01-22 08:29:55,627 INFO notebookUtils [pool-52-thread-1]: Rollback the operation 10c731d8-bdf8-4a86-a977-926f9d9cf7ee due to mount /default/Files failed.

...




2025-01-22 08:29:56,632 ERROR UserConsole [pool-52-thread-1]: java.util.concurrent.RejectedExecutionException: Task scala.concurrent.impl.CallbackRunnable@4960f147 rejected from java.util.concurrent.ThreadPoolExecutor@4de8e398[Shutting down, pool size = 2, active threads = 2, queued tasks = 0, completed tasks = 0]
2025-01-22 08:29:56,633 ERROR UserConsole [pool-52-thread-1]:

You can see that as-of 8:29:56 AM utc, this pyspark session was already falling over. Yet the notebook cells proceeded to start to be executed despite the problems. Eventually the notebook dies once we reach the part of the custom code where we expect the LH files to be available for use.

Has anyone already encountered this? Is it on a "known issues" list?

Does this product expect the notebook users to perform validation at the start of our notebooks to ensure that the session/environment isn't corrupted from the very beginning? It is not self-evident that customers in Fabric should do this duplicate work to verify the health of the session.

Is there a notebook util method we can execute, to ensure the session is usable?

Any help would be appreciated.

V-yubandi-msft · ‎01-23-2025

Hi @dbeavon3 ,

Thank you for bringing this issue to our attention. We have investigated the matter and verified that it is not a known issue at this time.

We apologize for the inconvenience and understand your frustration. Unfortunately, we cannot escalate this to the support team directly.

If you still face the issue then please consider raising a support ticket for further assistance.

To raise a support ticket for Fabric and Power BI, kindly follow the steps outlined in the following guide:

How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

Thankyou for your Paitence and understanding.

View solution in original post

dbeavon3 · ‎01-22-2025

I see now that there are multiple threads failing in the stderr of the driver.

At first I was only focused on pool-52-thread-1

... it is really unfortunate that the Fabric team allows notebook cells to start executing despite the severity of problems in the session itself. That only leads to massive confusion, and finger-pointing since the notebook ultimately dies in a custom code cell, yet this death is triggered by a problem that was put into motion a minute prior. (back within the code that belongs to Microsoft )

Hopefully the PG will start working on this soon, and add it to their known issues list for the sake of the community.

V-yubandi-msft · ‎01-23-2025

Hi @dbeavon3 ,

Thank you for bringing this issue to our attention. We have investigated the matter and verified that it is not a known issue at this time.

We apologize for the inconvenience and understand your frustration. Unfortunately, we cannot escalate this to the support team directly.

If you still face the issue then please consider raising a support ticket for further assistance.

To raise a support ticket for Fabric and Power BI, kindly follow the steps outlined in the following guide:

How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

Thankyou for your Paitence and understanding.

dbeavon3 · ‎01-28-2025

Hi @V-yubandi-msft

Is there any further help you can provide on the Microsoft side? After almost two weeks the some anonymous FTE (who is working with Mindtree on my case) is saying that this is not a bug for some reason.

There have even been prior ICM's, in addition to mine.

How can it be possible that this is NOT a bug, given that the default mount for the default lakehouse is missing when the notebook starts? So far they want me to check the health of our mounts in a manual way, to make sure the session wasn't corrupted prior to running our solution. That seems extremely unreasonable.

Obviously the docs say that the mount should be present, and users should not reasonably expect it to be otherwise.

.

https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-notebook-load-data#load-data-wit...

dbeavon3 · ‎01-23-2025

Hi @V-yubandi-msft

Thanks for the reply.

Do you mean it is not known to customers, or is not known to Mindtree (ICM's/SR's), or is not known to this Microsoft PG, or has not been published to the "known issues" list?

I am not sure I know what you mean, but I will go ahead and give you the benefit of the doubt. The ticket is reported here:
TrackingID#2501220040016849.

... but I was under the impression that I'm (at least) the tenth customer to report this bug. It seems extremely unlikely that these file-system mounting bugs are not known to the PG. In fact, it appears they already implemented 3 retries to try to avoid the bug:

In my experience the various PG's of Fabric will typically expect their customers to be the ones who own responsibilty for implementing all these "retries". But when a PG agrees to perform these retries in their own code, then you can be 100% certain that they are very familiar with the underlying bugs.

I also want to point out how unexpectedly this issue can arise. Out of thirty notebooks running in a loop, one of them might have a mounting issue like so:

Side note: I want to put these message here once more to make sure they appear in search results:

2025-01-22 08:29:55,625 INFO notebookUtils [pool-52-thread-1]: [checkDriverNodeMountStatus] operationId: 10c731d8-bdf8-4a86-a977-926f9d9cf7ee check driver node cost 30116ms, key:lakehousemounts_vm-ae003372_/default/Files/, result: timeout
2025-01-22 08:29:55,627 INFO notebookUtils [pool-52-thread-1]: Rollback the operation 10c731d8-bdf8-4a86-a977-926f9d9cf7ee due to mount /default/Files failed.
2025-01-22 08:29:55,736 INFO notebookUtils [pool-52-thread-2]: [checkDriverNodeMountStatus] operationId: a42ced8c-b6df-4623-8ab5-80c6b579ed2a check driver node cost 30103ms, key:lakehousemounts_vm-ae003372_/default/Tables/, result: timeout
2025-01-22 08:29:55,737 INFO notebookUtils [pool-52-thread-2]: Rollback the operation a42ced8c-b6df-4623-8ab5-80c6b579ed2a due to mount /default/Tables failed.

V-yubandi-msft · ‎02-03-2025

Hi @dbeavon3 ,

Thank you for reaching out and sharing your concerns. We understand that this issue can be frustrating, and we sincerely apologize for any inconvenience it has caused.

Although we do not have direct access to track internal tickets, the support team is responsible for managing the raised ticket and prioritizing cases accordingly. We appreciate you bringing this to our attention.

Thank you for your patience and understanding.

Regards,

Yugandhar.

dbeavon3 · ‎02-03-2025

Hi @V-yubandi-msft

I've noticed that the community is filled with responses from "community support" in the format you shared.

"Thanks"... "We understand"... "We apologize", ... "We appreciate"... "Please contact support".

This is pointless noise to add to the community forums, that are already noisy enough as it is.

OF COURSE I'm contacting support, but I'm also trying to communicate about the fabric bugs in the community, since Microsoft never shares any public information about their bugs. It is very likely that a dozen other customers are impacted by the same bugs, and are hoping to find meaningful support here. However if they see your response (1 week or 1 month or 1 year from now) they will gain absolutely nothing from it, as you can probably understand. Please be patient while I wait for Mindtree to offer us some meaningful support for these bugs.

V-yubandi-msft · ‎02-04-2025

Hi @dbeavon3 ,

We are unable to track support tickets directly. The support team is in charge of managing the raised tickets and prioritizing cases as needed. We trust that you understand the situation clearly.

Regards,

Yugandhar.

Question about internal bugs in "notebookUtils" "mount operation" for default lakehouse

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Question about internal bugs in "notebookUtils" "mount operation" for default lakehouse

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025