Re: Spark Cluster is Killing and Recreating Execut...

dbeavon3 · ‎10-10-2025

This is related to another question I posted. In this case there are no major errors exposed to my notebook cells, but I noticed the Spark UI is going bonkers. See below.

Notice that executors are going crazy and are getting constantly killed and recreated.

The overall notebook seems to run to completion but not without a ton of drama going on in the executors. Some tasks actually fail and have to be resubmitted as you can see in the image below.

I'm hoping to understand why Fabric's spark is behaving so crazy in a successfully executed notebook. If I can understand the reason why this is the "proper" behavior, then maybe I will be able to better distinguish the jobs that are failing.

How do I drill down and find the reason why these executors are being replaced so quickly. The most meaningful information I could find was in the driver's "stderr". For the first (and ONLY) executor it kills in the image above, there is no explanation whatsoever:

Please help me understand why this executor was decommissioned, despite being the ONLY executor in the spark cluster.

v-hashadapu · ‎10-13-2025

Hi @dbeavon3 , Thank you for reaching out to the Microsoft Fabric Community Forum.

Spark is designed to free up and re-request executors as needed, so frequent executor removal, even of the only active executor in some cases, is normal with dynamic allocation, especially in cloud or managed environments like Fabric. This doesn’t create errors or job failures, because Spark simply reschedules interrupted tasks, aiming to optimize resource use for cost and throughput. The absence of error messages in logs reflects that this is expected, not a malfunction. If the executor churn is disruptive in your Spark UI or job timelines, disabling dynamic allocation or increasing idle timeouts are the recommended adjustments, leading to steadier executor lifetimes and visual stability.

Dynamic Allocation in Apache Spark - Cloudera Community - 368095

Configuration - Spark 4.0.1 Documentation

Job Scheduling - Spark 4.0.1 Documentation

Reservation of Executors as part of Dynamic Allocation in Synapse Spark - Azure Synapse Analytics | ...

dbeavon3 · ‎10-14-2025

>> Spark is designed to free up and re-request executors as needed, so frequent executor removal, even of the only active executor in some cases, is normal

No the behavior I'm seeing is definitely not normal for OSS spark. The executors are killed for arbitrary/unknowable reasons while they are actively being used. We have a low number of max retries and don't want executors dying this frequently or it will either extend the duration of our jobs or cause them to fail altogether! From a cost perspective we don't want long-running work being repeated and running up our Fabric CU's for no good reason.

I think the Microsoft flavor of spark is doing something very unusual when it comes to killing executors. I found an undocumented configuration in the Microsoft flavor of spark called "spark.yarn.executor.decommission.enabled". It is not part of OSS Spark so I'm assuming this is what activates the unusual behavior on the Microsoft spark environment. I think I remember this from Synapse Analytics Workspaces as well, and hoped they had left it behind. IMO, This unusual home-grown behavior shouldn't be enabled by default.

v-hashadapu · ‎10-16-2025

Hi @dbeavon3 , Thank you for reaching out to the Microsoft Community Forum.

We understand what you are saying and are really sorry to hear of such inconvenience. Can you please refer below Microsoft documents to check if anything is unusual or odd with your Executor and driver logs. These logs also contain ways to guide us solve almost all kinds of odd behaviours, which may help you with this issue:

Spark memory issues - Azure Databricks | Microsoft Learn

Identifying an expensive read in Spark's DAG - Azure Databricks | Microsoft Learn

Diagnose cost and performance issues using the Spark UI - Azure Databricks | Microsoft Learn

Also, check out these documents that are within: Spark memory issues - Azure Databricks | Microsoft Learn

If these still didn’t help you with the issue, then the next best course is to raise a Microsoft Support ticket, as this would need to be resolved on their end.

Below is the link to help create Microsoft Support ticket:
How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

dbeavon3 · ‎10-16-2025

I will probably need to open a support ticket, although it will take a 3+ week investment with the MT team and I might get nowhere. I have worked with this PG in the past (years ago on Synapse Analytics) and I don't look forward to doing it again.

In the existing logs (driver and executor) there are plenty of spots that show WHEN executors are being intentionally killed. There are plenty of opportunities for the PG developers to say WHY they are being killed, but they do not explain the WHY. If they are intentionally withholding that information from their logs then the support case will be an agonizingly painful one. Some of the FTE engineers on the PG side (S.Y. and K.K.) do not seem to want to engage on problems related to failed executors or networking disconnections or whatever. Nor do they want to see things from the customer perspective, or care about costs. So it is a very difficult to engage with this team about Spark. Even so I will try.

Do you happen to know if there is any way in Fabric to get YARN logs? It is obvious that yarn is used as the scheduler, but it is not clear how customers are supposed to see the logs if something is misbehaving.

v-hashadapu · ‎10-17-2025

Hi @dbeavon3 , Thank you for reaching out to the Microsoft Community Forum.

We sincerely apologize for the experience you’ve had with the PG team. Please know that everyone at Microsoft is deeply committed to delivering exceptional customer service and ensuring your satisfaction. While I’m not part of the PG team, I’m confident they will be working diligently to resolve your issue as quickly as possible. We truly appreciate your patience and understanding as we work together toward a solution.

dbeavon3 · ‎10-17-2025

FYI, you shouldn't post a bunch of HDinsight links to a Fabric community. HDInsight is a dead product, and 99% of those docs won't apply to Fabric.

v-hashadapu · ‎10-22-2025

Hi @dbeavon3 , thanks for letting us know. We appreciate it.

dbeavon3 · ‎10-17-2025

While I’m not part of the PG team, I’m confident they will be working diligently to resolve your issue as quickly as possible.

I'm not confident at all. Most of the time we don't even encounter Microsoft FTE's - only the Mindtree (MT) engineers who have about the same understanding of Spark as the rest of us.

I already tried to reach out to FTE's on subreddit for Fabric but the spark folks don't seem to engage. In my experience if they don't engage on reddit, then they DEFINITELY won't engage on support tickets. It sounds backwards, but it is 100% true.

I'm playing with an undocumented configuration, spark.yarn.executor.decommission.enabled = false. If that doesn't go well then I'll invest the three weeks on a support case with MT. Even if they don't know what is going on, I'm sure they will have encountered lots of prior customers with the same issue (sudden death of spark executors).

rubayatyasmin · ‎10-10-2025

Hi,

1. Disable dynamic allocation.
Refer: pyspark - Why would Spark executors be removed (with "ExecutorAllocationManager: Request to remove e...

2. If you need to use dynamic allocation, then increase the timeout - Configuration - Spark 4.0.1 Documentation

3. This article has a good explanation of Yarn integration mechanism. You might find it helpful. https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr...
If nothing works, create a support ticket.

https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Syn...

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

dbeavon3 · ‎10-14-2025

I run lots of spark jobs in other hosts (eg. databricks or OSS on my machine). I don't see this behavior there. Thanks for your links. They describe normal spark functionality; we'd ideally see that happening in Fabric as well.

I mentioned to @v-hashadapu but I think what is happening in Fabric is not normal OSS behavior, and is driven by a configuration that is not part of OSS (spark.yarn.executor.decommission.enabled).

What I really want to know is the REASON why these executors are getting deallocated mid-way thru a series of tasks. There is nothing in the logs that explains why Microsoft decides to kill the executor (although I can see it is being deliberately killed for reasons that have nothing to do with me).

Oddly enough I have observed Fabric killing the one and only executor performing work on tasks, and then blocking/waiting to create another to take its place. What I'm looking for is a way to investigate the reason for the rapid killing of executors, just minutes or seconds after they are created! I don't see any explanation in these logs.

I suspect the behavior may have to do with high memory usage (leaks) but I haven't found a place to look at the memory or CPU consumption in the cluster either. What a pain to have no visibility whatsoever on the nodes of my cluster! Someone made really poor decisions when it comes to being able to monitor and manage a cluster in Fabric!

Spark Cluster is Killing and Recreating Executors in a Crazed Way (in a successful notebook)

Helpful resources

Fabric Data Days

Fabric Monthly Update - October 2025

FabCon Atlanta 2026

Fabric Data Days starts November 4th!

Spark Cluster is Killing and Recreating Executors in a Crazed Way (in a successful notebook)

Helpful resources

Fabric Data Days

Fabric Monthly Update - October 2025

FabCon Atlanta 2026