Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
dbeavon3
Memorable Member
Memorable Member

Spark Cluster's scheduler is killing container for some reason

Part-way thru the writing of a deltatable, there is a yarn container that is being intentionally killed.  Obviously this is causing problems for the work.

 

Can someone tell me why this is happening in my cluster?

 

The first indication that things are about to break is when this appears in the logs:

2025-10-10 14:57:02,686 INFO YarnSchedulerBackend$YarnDriverEndpoint [dispatcher-CoarseGrainedScheduler]: Disabling executor 1.

 

... it is immediately followed by "executor lost" messages, and other problems:

dbeavon3_0-1760116450934.png

 

 

Here are subsequent ERROR messages in the next couple seconds of the driver log:

 

dbeavon3_1-1760116663305.png

 

 

 

The errors above are from YarnClusterScheduler, TaskSetManager.

 

From the perspective of my custom code, the impact is that that Delta write operation will fail:

dbeavon3_2-1760116871668.png

 

ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container from a bad node: container_1760106952523_0001_01_000002 on host: vm-01e87934. Exit status: 137


Everything above is from the driver.  


Here is what I see in the notebook:

dbeavon3_3-1760117018558.png

 

 

            df_all_years.write.mode("overwrite").format("delta").save(lake_table_path)

 

 

Here is the Spark UI showing the death of the executor, during the delta write:

dbeavon3_4-1760117158958.png

 

 

The tasks for that job die part-way thru (11 out of 20).  They are using minimal RAM:

 

dbeavon3_6-1760117402425.png

 

The problem happens inconsistently from one day to another.  Sometimes writing my deltatables will work and other times it will fail.  This is a pretty crappy experience.  The Driver and Exectutors are configured to have 28 GB ram each:

dbeavon3_5-1760117267925.png

 

Any help would be appreciated. I need to be able to see the reason why yarn is killing the container and I cannot. Nor can I see the CPU or memory usage on the executor or the related VM, before it bites the dust.  This code runs fine on OSS Spark, but the Fabric environment is causing mysterious failures, and giving me no surface area for doing any investigations.

dbeavon3_7-1760117618465.png

 

 

 

 

0 REPLIES 0

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

September Fabric Update Carousel

Fabric Monthly Update - September 2025

Check out the September 2025 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Kudoed Authors