Spark Cluster's scheduler is killing container for...

dbeavon3

Part-way thru the writing of a deltatable, there is a yarn container that is being intentionally killed. Obviously this is causing problems for the work.

Can someone tell me why this is happening in my cluster?

The first indication that things are about to break is when this appears in the logs:

2025-10-10 14:57:02,686 INFO YarnSchedulerBackend$YarnDriverEndpoint [dispatcher-CoarseGrainedScheduler]: Disabling executor 1.

... it is immediately followed by "executor lost" messages, and other problems:

Here are subsequent ERROR messages in the next couple seconds of the driver log:

The errors above are from YarnClusterScheduler, TaskSetManager.

From the perspective of my custom code, the impact is that that Delta write operation will fail:

ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1760106952523_0001_01_000002 on host: vm-01e87934. Exit status: 137

Everything above is from the driver.

Here is what I see in the notebook:

df_all_years.write.mode("overwrite").format("delta").save(lake_table_path)

Here is the Spark UI showing the death of the executor, during the delta write:

The tasks for that job die part-way thru (11 out of 20). They are using minimal RAM:

The problem happens inconsistently from one day to another. Sometimes writing my deltatables will work and other times it will fail. This is a pretty crappy experience. The Driver and Exectutors are configured to have 28 GB ram each:

Any help would be appreciated. I need to be able to see the reason why yarn is killing the container and I cannot. Nor can I see the CPU or memory usage on the executor or the related VM, before it bites the dust. This code runs fine on OSS Spark, but the Fabric environment is causing mysterious failures, and giving me no surface area for doing any investigations.