<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137) in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4405429#M7138</link>
    <description>&lt;P&gt;I am in the process of migrating my entire warehousing solution from Azure Synapse Analytics into Fabric. All my jobs are developed in Spark notebooks in Synapse, so I figured this would be an easy move (Spark to Spark). However, once migrated and I started running the jobs in Fabric I notice that the majority of my jobs (which are all running fine, daily, i Synapse) are causing executors to fail in Fabric, and in a lot of cases even causing the entire spark session to go down.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am runnig my jobs on a small cluster in Synapse, with one driver and two executors. So I have simulated similar performance in Fabric by configuring an environment with a small pool using 1-3 nodes. On paper this should mean identical number of CPU cores and identical RAM assignment. And on paper it does (checking spark config when sessions are running).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The problem is that Fabric repeatedly fails to finish the jobs that Synapse can run with ease. The recurring error message I get is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;Lost executor 1 on vm-bbb21618: Container from a bad node: container_1739271142627_0001_01_000002 on host: vm-bbb21618. Exit status: 137. Diagnostics: [2025-02-11 11:08:05.211]Container killed on request. Exit code is 137&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;[2025-02-11 11:08:05.278]Container exited with a non-zero exit code 137.&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;[2025-02-11 11:08:05.291]Killed by external signal&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I assume this has to do with memory shortage on the executor - but how come the job runs fine in Synapse? Are there any fundamental differences to how Synapse and Fabric operates when it comes to Spark?&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The differences I can see when comparing the spark configuration between Synapse and Fabric is that Fabric assigns all of its memory as OffHeap (all 28Gb in the case of a small executor/node, whereas Synapse does not seem to do this). Exactly what/if any effect this would have I do not know, unfortunately. Hence me asking here.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;A job running for 5 minutes can easily go through the initial executors + 3-4 additional ones, as they die one by one..&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I tried brute forcing the issue by doubling the spark pool/memory (from 3 small nodes to 3 medium nodes), and this worked "better". But even then I eventually loose some executors to message 137 and and in some cases lose the entire session to livy death.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I am running latest version of spark available in Synapse, and using the latest Fabric Runtime (1.3). Native execution engine is turned off in Fabric.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Here is an example job execution, where an executor fails halfway through execution. The job is a simple as it gets: select 30mil records from a hive view (containing selects from some delta tables, and a few joins - nothing special). The view returns 30 mil records, and the result is merge result into a delta table.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="FelixL_0-1739352851761.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1237881i36FA67A63FD45A47/image-size/medium?v=v2&amp;amp;px=400" role="button" title="FelixL_0-1739352851761.png" alt="FelixL_0-1739352851761.png" /&gt;&lt;/span&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The error messages that are shown are recurring, regardless of what jobs I run. I get these errpr messages, as well as "Unable to update table xxxxx" adfdter almost every sucessful table load..&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="FelixL_1-1739352993873.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1237882iD4818880138BD38A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="FelixL_1-1739352993873.png" alt="FelixL_1-1739352993873.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The stderr log for the lost executor doesnt show me any error messages, but it does show me that there were a lot of free memory available at the time it went down...&lt;/P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="FelixL_0-1739355007835.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1237907iF771EE25FD9EB6C8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="FelixL_0-1739355007835.png" alt="FelixL_0-1739355007835.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;DIV&gt;Has anyone been successful in migrating Spark jobs to Fabric? Has anyone else experienced "random" crashes on the executors?&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 12 Feb 2025 11:26:20 GMT</pubDate>
    <dc:creator>FelixL</dc:creator>
    <dc:date>2025-02-12T11:26:20Z</dc:date>
    <item>
      <title>Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4405429#M7138</link>
      <description>&lt;P&gt;I am in the process of migrating my entire warehousing solution from Azure Synapse Analytics into Fabric. All my jobs are developed in Spark notebooks in Synapse, so I figured this would be an easy move (Spark to Spark). However, once migrated and I started running the jobs in Fabric I notice that the majority of my jobs (which are all running fine, daily, i Synapse) are causing executors to fail in Fabric, and in a lot of cases even causing the entire spark session to go down.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am runnig my jobs on a small cluster in Synapse, with one driver and two executors. So I have simulated similar performance in Fabric by configuring an environment with a small pool using 1-3 nodes. On paper this should mean identical number of CPU cores and identical RAM assignment. And on paper it does (checking spark config when sessions are running).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The problem is that Fabric repeatedly fails to finish the jobs that Synapse can run with ease. The recurring error message I get is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;Lost executor 1 on vm-bbb21618: Container from a bad node: container_1739271142627_0001_01_000002 on host: vm-bbb21618. Exit status: 137. Diagnostics: [2025-02-11 11:08:05.211]Container killed on request. Exit code is 137&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;[2025-02-11 11:08:05.278]Container exited with a non-zero exit code 137.&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&lt;EM&gt;[2025-02-11 11:08:05.291]Killed by external signal&lt;/EM&gt;&lt;/DIV&gt;&lt;DIV class="lia-indent-padding-left-30px"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I assume this has to do with memory shortage on the executor - but how come the job runs fine in Synapse? Are there any fundamental differences to how Synapse and Fabric operates when it comes to Spark?&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The differences I can see when comparing the spark configuration between Synapse and Fabric is that Fabric assigns all of its memory as OffHeap (all 28Gb in the case of a small executor/node, whereas Synapse does not seem to do this). Exactly what/if any effect this would have I do not know, unfortunately. Hence me asking here.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;A job running for 5 minutes can easily go through the initial executors + 3-4 additional ones, as they die one by one..&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I tried brute forcing the issue by doubling the spark pool/memory (from 3 small nodes to 3 medium nodes), and this worked "better". But even then I eventually loose some executors to message 137 and and in some cases lose the entire session to livy death.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I am running latest version of spark available in Synapse, and using the latest Fabric Runtime (1.3). Native execution engine is turned off in Fabric.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Here is an example job execution, where an executor fails halfway through execution. The job is a simple as it gets: select 30mil records from a hive view (containing selects from some delta tables, and a few joins - nothing special). The view returns 30 mil records, and the result is merge result into a delta table.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="FelixL_0-1739352851761.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1237881i36FA67A63FD45A47/image-size/medium?v=v2&amp;amp;px=400" role="button" title="FelixL_0-1739352851761.png" alt="FelixL_0-1739352851761.png" /&gt;&lt;/span&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The error messages that are shown are recurring, regardless of what jobs I run. I get these errpr messages, as well as "Unable to update table xxxxx" adfdter almost every sucessful table load..&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="FelixL_1-1739352993873.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1237882iD4818880138BD38A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="FelixL_1-1739352993873.png" alt="FelixL_1-1739352993873.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The stderr log for the lost executor doesnt show me any error messages, but it does show me that there were a lot of free memory available at the time it went down...&lt;/P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="FelixL_0-1739355007835.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1237907iF771EE25FD9EB6C8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="FelixL_0-1739355007835.png" alt="FelixL_0-1739355007835.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;DIV&gt;Has anyone been successful in migrating Spark jobs to Fabric? Has anyone else experienced "random" crashes on the executors?&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 12 Feb 2025 11:26:20 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4405429#M7138</guid>
      <dc:creator>FelixL</dc:creator>
      <dc:date>2025-02-12T11:26:20Z</dc:date>
    </item>
    <item>
      <title>Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4406975#M7159</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/608280"&gt;@FelixL&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Thanks for reaching out to the Microsoft fabric community forum.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We sincerely apologise for the inconvenience caused to you.&amp;nbsp;Please reach out to Microsoft Support by raising a ticket.&lt;/P&gt;
&lt;P&gt;Please refer below link on how to raise a contact support or support ticket.&lt;BR /&gt;&lt;A href="https://learn.microsoft.com/en-us/power-bi/support/create-support-ticket" target="_blank"&gt;How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also if you have any insights or suggestions on the Fabric platform please refer the below link.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://ideas.fabric.microsoft.com/" target="_self"&gt;Microsoft Fabric Ideas&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If I misunderstand your needs or you still have problems on it, please feel free to let us know.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;BR /&gt;Hammad.&lt;BR /&gt;Community Support Team&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If this post helps then please mark it as a solution, so that other members find it more quickly.&lt;/P&gt;
&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Feb 2025 12:11:52 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4406975#M7159</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2025-02-12T12:11:52Z</dc:date>
    </item>
    <item>
      <title>Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4413814#M7273</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/608280"&gt;@FelixL&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;As we haven’t heard back from you, so just following up to our previous message. I'd like to confirm if you've successfully resolved this issue or if you need further help.&lt;/P&gt;
&lt;P&gt;If yes, you are welcome to share your workaround and mark it as a solution so that other users can benefit as well. If you find a reply particularly helpful to you, you can also mark it as a solution.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;If you still have any questions or need more support, please feel free to let us know. We are more than happy to continue to help you.&lt;BR /&gt;Thank you for your patience and look forward to hearing from you.&lt;/P&gt;</description>
      <pubDate>Mon, 17 Feb 2025 12:25:33 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4413814#M7273</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2025-02-17T12:25:33Z</dc:date>
    </item>
    <item>
      <title>Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4415852#M7316</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The problem persists. I am investigating this togehter with MS support right now. We have verified multiple cases where Synapse sucessfully runs jobs, but Fabric seem to slowly collect garbage in memory that is not being released (at least not to the same extent as in Synapse). Unfortunately no solution as of yet.. The work around is to scale the jobs to run with 3-4x the pool sizes compmared to Synapse, then they &lt;STRONG&gt;usually&lt;/STRONG&gt; do not crash in Fabric.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Feb 2025 12:19:43 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4415852#M7316</guid>
      <dc:creator>FelixL</dc:creator>
      <dc:date>2025-02-18T12:19:43Z</dc:date>
    </item>
    <item>
      <title>Re: Fabric Spark fails on jobs that Azure Synapse Spark completes. (executor lost, bad node 137)</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4724719#M10000</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/608280"&gt;@FelixL&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;We are following up once again regarding your query. Could you please confirm if the issue has been resolved through the support ticket with Microsoft?&lt;/P&gt;
&lt;P&gt;If the issue has been resolved, we kindly request you to share the resolution or key insights here to help others in the community. If we don’t hear back, we’ll go ahead and close this thread.&lt;/P&gt;
&lt;P&gt;Should you need further assistance in the future, we encourage you to reach out via the Microsoft Fabric Community Forum and create a new thread. We’ll be happy to help.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you for your understanding and participation.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&amp;nbsp;&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Jun 2025 01:40:27 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Fabric-Spark-fails-on-jobs-that-Azure-Synapse-Spark-completes/m-p/4724719#M10000</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2025-06-09T01:40:27Z</dc:date>
    </item>
  </channel>
</rss>

