Advance your Data & AI career with 50 days of live learning, dataviz contests, hands-on challenges, study groups & certifications and more!
Get registeredJoin us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM. Register now.
Hello All,
We are currently using F128 capacity and Spark pool configuration: Runtime 1.2 (Spark 3.4 Delta 2.4), Compute Medium, 1-95 node. The issue is for compute exhaustive queries, the notebook execution is getting failed with the following error, any help will be highly appreciated.
Job aborted due to stage failure: Task 167 in stage 2082.0 failed 4 times, most recent failure: Lost task 167.3 in stage 2082.0 (TID 594169) (vm-e5657433 executor 274): ExecutorLostFailure (executor 274 exited caused by one of the running tasks) Reason: Container from a bad node: container_1720586556487_0001_01_000296 on host: vm-e5657433. Exit status: 137. Diagnostics: [2024-07-10 05:41:58.081]Container killed on request. Exit code is 137
[2024-07-10 05:41:58.098]Container exited with a non-zero exit code 137.
[2024-07-10 05:41:58.101]Killed by external signal
Hi! Were you able to find a solution to your issue? am facing similar issues when migrating jobs from Azure Synapse to Fabric The jobs run fine in Synapse, but I get "bad node" executor lost error messages left and right in Fabric. No idea why... (memory and CPU is not satuated at the time of executor crashing)
Thanks!
Hi @abhradwip ,
The error message "Exit code is 137" usually means that the container exceeded its memory limit or was manually terminated. There are several steps you can take to find the problem:
If it does not help, you can provide me with your code so that I can find out if you are having problems. In the meantime you can provide some details, for example, did it run normally before and suddenly get an error today. If so, did you do anything else before running it.
Best Regards,
Ada Wang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hello @Anonymous Thanks a lot for your response. The issue didnt resolve, please check below details -
Use case :
We are also seeing lots of 137 errors. It is frustrating on fabric that we have no visibility to the yarn diagnostics. I think it is a major blind spot and creates a lot of confusion.
We shouldn't be getting meaningless errors based on yarn configurations and yarn logs that aren't available to us. Spark is complex enough as it is, but when Fabric puts blinders on us, it becomes pretty frustrating to use. Some of us are trying to run mission-critical workloads in here, and can't even see the yarn logs.
Anyway, for those of you who aren't convinced that 137 means a yarn memory error ("early OOM"), then you should google for it. This behavior has been part of YARN/spark for a very long time, in a lot of platforms.
You should also start monitoring the "executors" tab in the spark UI, and check the boxes at the top to see the heap usage of executors. If you are exhausting 30 or 40 or 50 GB of ram in your executors, then you will surely be able to see that in realtime as it happens! For us the problem was this stupid feature called "optimized delta writes". It consumes a MASSIVE amount of ram (20 x the amount of data in parquet), and needs to be avoided whenever executors are crashing.
I finished the MT CSS support case (pro). The engineer is Chirag on Deepak's team in the Eastern US timezone.
They have a way to use kusto logs to retrieve yarn messages. Unfortunately they wouldn't share the kusto query syntax. And they say the telemetry logs are internal, in any case.
Below is the message that they say they retrieved. Obviously they are able to retrieve log data directly from yarn, unlike their customers. The following is verbatim from Chirag.
2025-10-21 23:05:16,763 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0, absoluteMaxCapacity=1.0, state=RUNNING, acls=SUBMIT_APP:*ADMINISTER_QUEUE:*, labels=*,
This indicates the capacity reached 100%.
Hopefully this is helpful. I'm still not satisfied that customers are blindfolded when we encounter yarn-related failures.
Hi @abhradwip ,
Can I ask if your problem is resolved? It looks as if it is an intermittent issue, you might consider updating your drivers, clearing your cache and then refreshing your browser and re-running your Notebook to see if the issue is resolved.
Best Regards,
Ada Wang
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!
Check out the September 2025 Fabric update to learn about new features.
| User | Count |
|---|---|
| 14 | |
| 7 | |
| 2 | |
| 2 | |
| 2 |