topic Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137) in Data Engineering

Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)

FelixL — Wed, 12 Feb 2025 11:26:20 GMT

I am in the process of migrating my entire warehousing solution from Azure Synapse Analytics into Fabric. All my jobs are developed in Spark notebooks in Synapse, so I figured this would be an easy move (Spark to Spark). However, once migrated and I started running the jobs in Fabric I notice that the majority of my jobs (which are all running fine, daily, i Synapse) are causing executors to fail in Fabric, and in a lot of cases even causing the entire spark session to go down.

I am runnig my jobs on a small cluster in Synapse, with one driver and two executors. So I have simulated similar performance in Fabric by configuring an environment with a small pool using 1-3 nodes. On paper this should mean identical number of CPU cores and identical RAM assignment. And on paper it does (checking spark config when sessions are running).

The problem is that Fabric repeatedly fails to finish the jobs that Synapse can run with ease. The recurring error message I get is:

Lost executor 1 on vm-bbb21618: Container from a bad node: container_1739271142627_0001_01_000002 on host: vm-bbb21618. Exit status: 137. Diagnostics: [2025-02-11 11:08:05.211]Container killed on request. Exit code is 137

[2025-02-11 11:08:05.278]Container exited with a non-zero exit code 137.

[2025-02-11 11:08:05.291]Killed by external signal

I assume this has to do with memory shortage on the executor - but how come the job runs fine in Synapse? Are there any fundamental differences to how Synapse and Fabric operates when it comes to Spark?

The differences I can see when comparing the spark configuration between Synapse and Fabric is that Fabric assigns all of its memory as OffHeap (all 28Gb in the case of a small executor/node, whereas Synapse does not seem to do this). Exactly what/if any effect this would have I do not know, unfortunately. Hence me asking here.

A job running for 5 minutes can easily go through the initial executors + 3-4 additional ones, as they die one by one..

I tried brute forcing the issue by doubling the spark pool/memory (from 3 small nodes to 3 medium nodes), and this worked "better". But even then I eventually loose some executors to message 137 and and in some cases lose the entire session to livy death.

I am running latest version of spark available in Synapse, and using the latest Fabric Runtime (1.3). Native execution engine is turned off in Fabric.

Here is an example job execution, where an executor fails halfway through execution. The job is a simple as it gets: select 30mil records from a hive view (containing selects from some delta tables, and a few joins - nothing special). The view returns 30 mil records, and the result is merge result into a delta table.

The error messages that are shown are recurring, regardless of what jobs I run. I get these errpr messages, as well as "Unable to update table xxxxx" adfdter almost every sucessful table load..

The stderr log for the lost executor doesnt show me any error messages, but it does show me that there were a lot of free memory available at the time it went down...

Has anyone been successful in migrating Spark jobs to Fabric? Has anyone else experienced "random" crashes on the executors?

Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)

Anonymous — Wed, 12 Feb 2025 12:11:52 GMT

Hi @FelixL,

Thanks for reaching out to the Microsoft fabric community forum.

We sincerely apologise for the inconvenience caused to you. Please reach out to Microsoft Support by raising a ticket.

Please refer below link on how to raise a contact support or support ticket.
How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

Also if you have any insights or suggestions on the Fabric platform please refer the below link.

Microsoft Fabric Ideas

If I misunderstand your needs or you still have problems on it, please feel free to let us know.

Best Regards,
Hammad.
Community Support Team

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)

Anonymous — Mon, 17 Feb 2025 12:25:33 GMT

Hi @FelixL,

As we haven’t heard back from you, so just following up to our previous message. I'd like to confirm if you've successfully resolved this issue or if you need further help.

If yes, you are welcome to share your workaround and mark it as a solution so that other users can benefit as well. If you find a reply particularly helpful to you, you can also mark it as a solution.

If you still have any questions or need more support, please feel free to let us know. We are more than happy to continue to help you.
Thank you for your patience and look forward to hearing from you.

Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)

FelixL — Tue, 18 Feb 2025 12:19:43 GMT

Hi,

The problem persists. I am investigating this togehter with MS support right now. We have verified multiple cases where Synapse sucessfully runs jobs, but Fabric seem to slowly collect garbage in memory that is not being released (at least not to the same extent as in Synapse). Unfortunately no solution as of yet.. The work around is to scale the jobs to run with 3-4x the pool sizes compmared to Synapse, then they usually do not crash in Fabric.

Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)

Anonymous — Mon, 09 Jun 2025 01:40:27 GMT

Hi @FelixL,

We are following up once again regarding your query. Could you please confirm if the issue has been resolved through the support ticket with Microsoft?

If the issue has been resolved, we kindly request you to share the resolution or key insights here to help others in the community. If we don’t hear back, we’ll go ahead and close this thread.

Should you need further assistance in the future, we encourage you to reach out via the Microsoft Fabric Community Forum and create a new thread. We’ll be happy to help.

Thank you for your understanding and participation.