Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered

Reply
smoqt
Advocate I
Advocate I

Fabric Notebook Stuck in "Running" Status

This weekend, a Fabric Notebook that is scheduled to run weekly ran for over 36 hours before being terminated manually by our team.

 

Specifically, this occurred while running an OPTIMIZE statement against a small lakehouse table.

 

When I checked the Sparksession details of the stage it was stuck on, it simply says "No tasks have started yet."

 

I checked stderr logs and found a number of them generated for the driver node.  One message I see again and again is "Unable to cache access token for X to Y java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher" (where X and Y are something different each time).

 

While researching, I found a very similar issue posted to reddit last week: https://www.reddit.com/r/MicrosoftFabric/comments/1l796d8/stuck_spark_job/

Is this a known issue?  Are there any work-arounds to prevent it going forward or plans to resolve the issue?

 

3 REPLIES 3
v-pnaroju-msft
Community Support
Community Support

Hi smoqt,

Thank you for your follow-up and the update provided.

  1. At present, I have not found any documentation to confirm that this is a bug.The underlying cause seems to be related to improper token caching logic during Spark session initialization, as well as missing runtime dependencies (such as ZooKeeper classes) in the driver environment.

  2. Pipeline activity timeouts do not terminate the associated Spark job. Consequently, the notebook may continue to run in the background, consuming Capacity Units (CUs). To manage this, we recommend using the "Monitor Activities" feature in Microsoft Fabric, as detailed in the Monitor activities in Microsoft Fabric - Training | Microsoft Learn, to track and manually terminate long-running jobs. Additionally, you may consider implementing a watchdog script leveraging the REST API to monitor and cancel jobs exceeding a predefined runtime.

  3. Please utilise the Capacity Metrics app to identify notebooks that have high CU consumption. It is advisable to break lengthy notebooks into smaller, modular units to minimise the risk of idle drivers. Furthermore, configuring the Spark job admission rules to limit concurrency may help prevent system overload.

Given the systemic nature of this issue and the potential cost implications, we recommend raising a support ticket with Microsoft for a thorough investigation and resolution. You may raise a ticket via the Microsoft Fabric Support and Status | Microsoft Fabric.

Additionally, please refer to the following links:
What is the Microsoft Fabric Capacity Metrics app? - Microsoft Fabric | Microsoft Learn

Understand the metrics app compute page - Microsoft Fabric | Microsoft Learn
Install the Microsoft Fabric capacity metrics app - Microsoft Fabric | Microsoft Learn

If you find our response helpful, kindly mark it as the accepted solution. This will assist other community members facing similar queries.

Should you have any further questions, please feel free to reach out to the Microsoft Fabric community.

Thank you.

v-pnaroju-msft
Community Support
Community Support

Hi smoqt,

We sincerely appreciate your inquiry through the Microsoft Fabric Community Forum.

Based on my understanding, the issue pertains to a Spark driver-side failure in Microsoft Fabric, where the caching of authentication tokens intermittently fails. Specifically, the error message "java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher" indicates a runtime environment dependency problem, which causes the Spark session to stall indefinitely before launching any tasks. This issue is also reflected in the message “No tasks have started yet.” I think this may be related to driver misconfiguration during token handling, particularly in long-running or idle notebook sessions.

Kindly follow the workaround steps provided below, which may help resolve the issue:

  1. Incorporate error-handling logic in your notebook to gracefully retry the OPTIMIZE command if it fails.
  2. Since the OPTIMIZE command is metadata-intensive, it is advisable to execute it during periods of lower usage to reduce Spark queuing and driver load.
  3. Configure notebook activity timeouts or set up Fabric pipeline alerts to prevent prolonged hanging.

Additionally, please refer to the following links for further guidance:
Apache Spark compute for Data Engineering and Data Science - Microsoft Fabric | Microsoft Learn
Concurrency limits and queueing in Apache Spark for Fabric - Microsoft Fabric | Microsoft Learn

If you find our response helpful, kindly mark it as the accepted solution and provide kudos. This will assist other community members facing similar issues.

Should you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.

I appreciate the response, but I still have some concerns and would like clarification on a few points:

  1. Is this a confirmed and documented bug?
    If so, could you please share a link or internal reference? I found a nearly identical issue discussed on Reddit, which makes me think this may be systemic.

  2. Error-handling logic doesn't apply here.
    In my case, no error was thrown in the notebook itself. The only errors were in the driver node logs (e.g., token caching failure and NoClassDefFoundError), which were not surfaced to the notebook runtime. So retry logic wouldn’t help, since the notebook never technically failed.

  3. The notebook already runs during low-usage hours.
    It is scheduled to run early Sunday mornings, which should already avoid peak concurrency.

  4. Regarding pipeline timeouts:
    If I wrap the notebook in a pipeline with an activity timeout, will that actually terminate the Spark job? Or will the notebook continue to run indefinitely in the background, consuming Capacity Units (CUs)? This is critical for cost management.

My biggest concern is the potential for runaway CU consumption with no clear failure or timeout mechanism. Any guidance on how to monitor or guard against this scenario—especially when Spark stalls before launching any tasks—would be appreciated.

Helpful resources

Announcements
May FBC25 Carousel

Fabric Monthly Update - May 2025

Check out the May 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.