Solved: Re: Spark Cluster's scheduler is killing container...

dbeavon3 · ‎10-10-2025

Part-way thru the writing of a deltatable, there is a yarn container that is being intentionally killed. Obviously this is causing problems for the work.

Can someone tell me why this is happening in my cluster?

The first indication that things are about to break is when this appears in the logs:

2025-10-10 14:57:02,686 INFO YarnSchedulerBackend$YarnDriverEndpoint [dispatcher-CoarseGrainedScheduler]: Disabling executor 1.

... it is immediately followed by "executor lost" messages, and other problems:

Here are subsequent ERROR messages in the next couple seconds of the driver log:

The errors above are from YarnClusterScheduler, TaskSetManager.

From the perspective of my custom code, the impact is that that Delta write operation will fail:

ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1760106952523_0001_01_000002 on host: vm-01e87934. Exit status: 137

Everything above is from the driver.

Here is what I see in the notebook:

df_all_years.write.mode("overwrite").format("delta").save(lake_table_path)

Here is the Spark UI showing the death of the executor, during the delta write:

The tasks for that job die part-way thru (11 out of 20). They are using minimal RAM:

The problem happens inconsistently from one day to another. Sometimes writing my deltatables will work and other times it will fail. This is a pretty crappy experience. The Driver and Exectutors are configured to have 28 GB ram each:

Any help would be appreciated. I need to be able to see the reason why yarn is killing the container and I cannot. Nor can I see the CPU or memory usage on the executor or the related VM, before it bites the dust. This code runs fine on OSS Spark, but the Fabric environment is causing mysterious failures, and giving me no surface area for doing any investigations.

dbeavon3 · ‎10-24-2025

I finished the MT CSS support case (pro). The engineer is Chirag on Deepak's team in the Eastern US timezone.

They have a way to use kusto logs to retrieve yarn messages. Unfortunately they wouldn't share the kusto query syntax. And they say the telemetry logs are internal, in any case.

Below is the message that they say they retrieved. Obviously they are able to retrieve log data directly from yarn, unlike their customers. The following is verbatim from Chirag.

When the memory limit is reached, the container is terminated.

2025-10-21 23:05:16,763 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0, absoluteMaxCapacity=1.0, state=RUNNING, acls=SUBMIT_APP:*ADMINISTER_QUEUE:*, labels=*,

This indicates the capacity reached 100%.

Hopefully this is helpful. I'm still not satisfied that customers are blindfolded when we encounter yarn-related failures.

View solution in original post

v-pnaroju-msft · ‎10-23-2025

Hi dbeavon3,

We are following up to inquire whether you have raised the support ticket. If you have already done so, we kindly request you to share your feedback regarding the issue raised.
If you need any more assistance, please feel free to connect with the Microsoft Fabric community.

Thank you.

v-pnaroju-msft · ‎10-19-2025

Hi dbeavon3,

We sincerely apologize for the inconvenience caused. To enable further investigation, we kindly request you to raise a Microsoft support ticket using the link: Microsoft Fabric Support and Status | Microsoft Fabric

If you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.

v-pnaroju-msft · ‎10-16-2025

Hi dbeavon3,

We would like to follow up and see whether the details we shared have resolved your problem.
If you need any more assistance, please feel free to connect with the Microsoft Fabric community.

Thank you.

dbeavon3 · ‎10-16-2025

No it doesn't help. The reason I posted was to get a specific diagnosis for the killing of an executor and I'm not any closer to that. Since it appears to be deliberately killed by this spark environment, then the REASON for that should be discoverable. But I haven't found the messages in the logs saying WHY spark killed my executor, and you haven't shown me where to look for those messages.

v-pnaroju-msft · ‎10-13-2025

Hi dbeavon3,

Thank you for contacting the Microsoft Fabric Community Forum.

Based on my understanding, this behaviour can occur when YARN terminates the Spark executor container (exit code 137) due to excessive memory usage or node health problems. In Fabric, this may happen if the executor exceeds its allocated memory including Python or native memory or if a node becomes unstable, resulting in container removal and subsequent Delta write failures.

Please follow the steps below, which may help resolve the issue:

Use a larger compute size, for example: Medium or Memory Optimised.
Repartition the data or reduce shuffle pressure to lower per task memory usage.
Enable Spark monitoring and diagnostics to capture executor metrics prior to failure.

Additionally, please refer to the following link:
Notebook contextual monitoring and debugging - Microsoft Fabric | Microsoft Learn

We hope the information provided helps to resolve the issue. Should you have any further queries, please feel free to contact the Microsoft Fabric Community.

Thank you.

dbeavon3 · ‎10-16-2025

Hi @v-pnaroju-msft

>>Based on my understanding, this behaviour can occur when YARN terminates the Spark executor container (exit code 137) due to excessive memory usage or node health problems

Can you share a source for this? There is no evidence of excessive memory usages or health problems. Else I would not have posted all the logs above. The executor is killed arbitrarily, according to the details I shared in this example.

I understand that there are various theoretical reasons why an executor might need to be terminated. This is not helpful. I'm looking for evidence to show why this particular executor was deliberately terminated.

I'd rather have evidence than rely on guesswork.

This is the reason I shared so many logs and observations. If you believe the executor ran out of memory, then show me the OOM error or tell me where to find it. I'm not going to double the size of the nodes based on guesswork... that simply puts money into Microsoft's pocket and doesn't even guarantee that the problem will be permanently resolved.

FYI, The link you shared does not give a way to see the actual memory or CPU consumed. If you believe memory is the problem then please share an approach for monitoring the memory consumption of these drivers and executors.

dbeavon3 · ‎10-24-2025

I finished the MT CSS support case (pro). The engineer is Chirag on Deepak's team in the Eastern US timezone.

They have a way to use kusto logs to retrieve yarn messages. Unfortunately they wouldn't share the kusto query syntax. And they say the telemetry logs are internal, in any case.

Below is the message that they say they retrieved. Obviously they are able to retrieve log data directly from yarn, unlike their customers. The following is verbatim from Chirag.

When the memory limit is reached, the container is terminated.

2025-10-21 23:05:16,763 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0, absoluteMaxCapacity=1.0, state=RUNNING, acls=SUBMIT_APP:*ADMINISTER_QUEUE:*, labels=*,

This indicates the capacity reached 100%.

Hopefully this is helpful. I'm still not satisfied that customers are blindfolded when we encounter yarn-related failures.

dbeavon3 · ‎10-24-2025

I also wanted to mention that the main reason we keep running into memory problem is because of a feature taking effect when saving deltatable. It was a feature called "optimized" deltatable storage.

...It was super obnoxious and we found the setting to disable it, thereby saving massive amounts of ram in spark executors.

I'm told that my workspace was affected because it was created at the beginning of the year. Workspaces created after mid-2025 will no longer use this functionality by default and won't have as many memory issues.

Spark Cluster's scheduler is killing container for some reason

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

Fabric Data Days starts November 4th!

Spark Cluster's scheduler is killing container for some reason

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026