Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
abhradwip
New Member

Notebook execution failure due to bad node

Hello All,

We are currently using F128 capacity and Spark pool configuration: Runtime 1.2 (Spark 3.4 Delta 2.4), Compute Medium, 1-95 node. The issue is for compute exhaustive queries, the notebook execution is getting failed with the following error, any help will be highly appreciated.

 

Job aborted due to stage failure: Task 167 in stage 2082.0 failed 4 times, most recent failure: Lost task 167.3 in stage 2082.0 (TID 594169) (vm-e5657433 executor 274): ExecutorLostFailure (executor 274 exited caused by one of the running tasks) Reason: Container from a bad node: container_1720586556487_0001_01_000296 on host: vm-e5657433. Exit status: 137. Diagnostics: [2024-07-10 05:41:58.081]Container killed on request. Exit code is 137
[2024-07-10 05:41:58.098]Container exited with a non-zero exit code 137.
[2024-07-10 05:41:58.101]Killed by external signal

4 REPLIES 4
FelixL
Advocate II
Advocate II

Hi! Were you able to find a solution to your issue?  am facing similar issues when migrating jobs from Azure Synapse to Fabric The jobs run fine in Synapse, but I get "bad node" executor lost error messages left and right in Fabric. No idea why... (memory and CPU is not satuated at the time of executor crashing) 

 

Thanks! 

Anonymous
Not applicable

Hi @abhradwip ,

The error message "Exit code is 137" usually means that the container exceeded its memory limit or was manually terminated. There are several steps you can take to find the problem:

  1. Monitor memory usage in the Spark UI to determine if certain nodes are out of memory.
  2. Consider increasing the memory allocation for the executable.
  3. Check the operational status and state of the nodes identified in the error message "vm-e5657433".

If it does not help, you can provide me with your code so that I can find out if you are having problems. In the meantime you can provide some details, for example, did it run normally before and suddenly get an error today. If so, did you do anything else before running it.

 

Best Regards,

Ada Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

Hello @Anonymous  Thanks a lot for your response. The issue didnt resolve, please check below details - 

 

Use case :

  • Date size: 10TB Retail Data distributed across 24 delta tables in Lakehouse, 4 tables more than 1TB size
  • Spark SQL Notebook - Select queries
  • Configurations tested:
    1. Capacity: F128 (Number of nodes 96 – Max utilized)
      • Spark cluster: Runtime 1.2(Spark 3.4, Delta 2.4)
      • Node size – Medium, Large, X Large, XX Large (Same issue with All)
      • Autoscale enabled
    2. Capacity F256 (Number of nodes 192 – Max utilized)
      • Spark cluster: Runtime 1.2(Spark 3.4, Delta 2.4)
      • Node size – Medium, Large, X Large, XX Large (Same issue with All)
      • Autoscale enabled
    3. Execution: Spark SQL Notebook (Analytical queries) against lakehouse
  • Issues:
    1. Bad Node : Spark_System_Executor_ExitCode137BadNode - Job aborted due to stage failure: Task 167 in stage 2082.0 failed 4 times, most recent failure: Lost task 167.3 in stage 2082.0 (TID 594169) (vm-e5657433 executor 274): ExecutorLostFailure (executor 274 exited caused by one of the running tasks) Reason: Container from a bad node: container_1720586556487_0001_01_000296 on host: vm-e5657433.
    2. Executer lost: The job  failed because the executor that it was running on was lost. This may happen because the task crashed the JVM. ExitCode 137 is when a container (Spark executor) runs out of memory, YARN automatically kills it. Possible causes: 1. Driver memory issues 2. Executor Memory issues 3. Executor lost Please check the logs and see if increasing Spark partitions, number of spark partitions or reducing number of executor cores.
Anonymous
Not applicable

Hi @abhradwip ,

Can I ask if your problem is resolved? It looks as if it is an intermittent issue, you might consider updating your drivers, clearing your cache and then refreshing your browser and re-running your Notebook to see if the issue is resolved.

 

Best Regards,

Ada Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.