Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM. Register now.

Reply
abhradwip
New Member

Notebook execution failure due to bad node

Hello All,

We are currently using F128 capacity and Spark pool configuration: Runtime 1.2 (Spark 3.4 Delta 2.4), Compute Medium, 1-95 node. The issue is for compute exhaustive queries, the notebook execution is getting failed with the following error, any help will be highly appreciated.

 

Job aborted due to stage failure: Task 167 in stage 2082.0 failed 4 times, most recent failure: Lost task 167.3 in stage 2082.0 (TID 594169) (vm-e5657433 executor 274): ExecutorLostFailure (executor 274 exited caused by one of the running tasks) Reason: Container from a bad node: container_1720586556487_0001_01_000296 on host: vm-e5657433. Exit status: 137. Diagnostics: [2024-07-10 05:41:58.081]Container killed on request. Exit code is 137
[2024-07-10 05:41:58.098]Container exited with a non-zero exit code 137.
[2024-07-10 05:41:58.101]Killed by external signal

6 REPLIES 6
FelixL
Advocate II
Advocate II

Hi! Were you able to find a solution to your issue?  am facing similar issues when migrating jobs from Azure Synapse to Fabric The jobs run fine in Synapse, but I get "bad node" executor lost error messages left and right in Fabric. No idea why... (memory and CPU is not satuated at the time of executor crashing) 

 

Thanks! 

Anonymous
Not applicable

Hi @abhradwip ,

The error message "Exit code is 137" usually means that the container exceeded its memory limit or was manually terminated. There are several steps you can take to find the problem:

  1. Monitor memory usage in the Spark UI to determine if certain nodes are out of memory.
  2. Consider increasing the memory allocation for the executable.
  3. Check the operational status and state of the nodes identified in the error message "vm-e5657433".

If it does not help, you can provide me with your code so that I can find out if you are having problems. In the meantime you can provide some details, for example, did it run normally before and suddenly get an error today. If so, did you do anything else before running it.

 

Best Regards,

Ada Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

Hello @Anonymous  Thanks a lot for your response. The issue didnt resolve, please check below details - 

 

Use case :

  • Date size: 10TB Retail Data distributed across 24 delta tables in Lakehouse, 4 tables more than 1TB size
  • Spark SQL Notebook - Select queries
  • Configurations tested:
    1. Capacity: F128 (Number of nodes 96 – Max utilized)
      • Spark cluster: Runtime 1.2(Spark 3.4, Delta 2.4)
      • Node size – Medium, Large, X Large, XX Large (Same issue with All)
      • Autoscale enabled
    2. Capacity F256 (Number of nodes 192 – Max utilized)
      • Spark cluster: Runtime 1.2(Spark 3.4, Delta 2.4)
      • Node size – Medium, Large, X Large, XX Large (Same issue with All)
      • Autoscale enabled
    3. Execution: Spark SQL Notebook (Analytical queries) against lakehouse
  • Issues:
    1. Bad Node : Spark_System_Executor_ExitCode137BadNode - Job aborted due to stage failure: Task 167 in stage 2082.0 failed 4 times, most recent failure: Lost task 167.3 in stage 2082.0 (TID 594169) (vm-e5657433 executor 274): ExecutorLostFailure (executor 274 exited caused by one of the running tasks) Reason: Container from a bad node: container_1720586556487_0001_01_000296 on host: vm-e5657433.
    2. Executer lost: The job  failed because the executor that it was running on was lost. This may happen because the task crashed the JVM. ExitCode 137 is when a container (Spark executor) runs out of memory, YARN automatically kills it. Possible causes: 1. Driver memory issues 2. Executor Memory issues 3. Executor lost Please check the logs and see if increasing Spark partitions, number of spark partitions or reducing number of executor cores.

We are also seeing lots of 137 errors.  It is frustrating on fabric that we have no visibility to the yarn diagnostics.  I think it is a major blind spot and creates a lot of confusion.

 

We shouldn't be getting meaningless errors based on yarn configurations and yarn logs that aren't available to us.  Spark is complex enough as it is, but when Fabric puts blinders on us, it becomes pretty frustrating to use.  Some of us are trying to run mission-critical workloads in here, and can't even see the yarn logs.

 

 

Anyway, for those of you who aren't convinced that 137 means a yarn memory error ("early OOM"), then you should google for it.  This behavior has been part of YARN/spark for a very long time, in a lot of platforms.

 

You should also start monitoring the "executors" tab in the spark UI, and check the boxes at the top to see the heap usage of executors.  If you are exhausting 30 or 40 or 50 GB of ram in your executors, then you will surely be able to see that in realtime as it happens!  For us the problem was this stupid feature called "optimized delta writes".  It consumes a MASSIVE amount of ram (20 x the amount of data in parquet), and needs to be avoided whenever executors are crashing.

 

 

 

I finished the MT CSS support case (pro).  The engineer is Chirag on Deepak's team in the Eastern US timezone.

 

They have a way to use kusto logs to retrieve yarn messages.  Unfortunately they wouldn't share the kusto query syntax.  And they say the telemetry logs are internal, in any case.

 

Below is the message that they say they retrieved.  Obviously they are able to retrieve log data directly from yarn, unlike their customers.  The following is verbatim from Chirag.

 

 

  • When the memory limit is reached, the container is terminated.

2025-10-21 23:05:16,763 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0, absoluteMaxCapacity=1.0, state=RUNNING, acls=SUBMIT_APP:*ADMINISTER_QUEUE:*, labels=*,

 

This indicates the capacity reached 100%.

 

 

Hopefully this is helpful.  I'm still not satisfied that customers are blindfolded when we encounter yarn-related failures.

Anonymous
Not applicable

Hi @abhradwip ,

Can I ask if your problem is resolved? It looks as if it is an intermittent issue, you might consider updating your drivers, clearing your cache and then refreshing your browser and re-running your Notebook to see if the issue is resolved.

 

Best Regards,

Ada Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

September Fabric Update Carousel

Fabric Monthly Update - September 2025

Check out the September 2025 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.