<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Container exited with a non-zero exit code 137 in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Container-exited-with-a-non-zero-exit-code-137/m-p/4403643#M7113</link>
    <description>&lt;P&gt;Were you able to get this issue fixed? I am experiencing the exact same issue. A lot of executors failing with error code 137. I am migrating jobs currently running fine (daily, never once crashing) from Azure Synapse into Fabric. I am using identical pool sizes, but even so - the fabric jobs are crashing left and right.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Even when doubling the spark pool size (going from 3x small nodes to 3x medium nodes) I am seeing similar executor failures. Sometimes the jobs manage to finish, sometimes they pull the livy session down with them and the entire application fails.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Monitoring the spark application memory usage while executing, the executors are only satuated to around 50% memory usage when they die. They do however almost always die when fully utilized on CPU...&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have tried everything; disabled persisting of dataframes, increased overhead memory on executors, ... But no change; Fabric just cant keep my simple jobs alive. And they are simple; reading from delta, saving to delta, working with 100MB-4GB delta tables. This can be run on a potato, but apparently not in Fabric..&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note: I am not using Native Execution engine, because that **bleep** brings it own case of issues to the party. Gluten exploding in my face at every turn.. So this should be as close to 1:1 to Azure Synapse as it gets, I would think..&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 10 Feb 2025 20:08:29 GMT</pubDate>
    <dc:creator>FelixL</dc:creator>
    <dc:date>2025-02-10T20:08:29Z</dc:date>
    <item>
      <title>Container exited with a non-zero exit code 137</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Container-exited-with-a-non-zero-exit-code-137/m-p/4345721#M5806</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;SPAN&gt;Hi I have a Delta table with&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;252&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;322&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;508 rows. The data contains some duplicates. I have a merge statement that deletes these duplicates (I can easily identify them with a query and it's around 500k duplicate rows). I have tried liquid clustering, partitioning on year and month columns but each time i run a merge command along the lines of:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;delete_duplicates_sql = f"""
MERGE INTO delta.`{target_table_path}` AS target
USING (
SELECT * FROM RankedRowsToDelete
) AS source
ON source.{target_id} = target.{target_id} AND {watermark_join_on_expression} AND COALESCE(CAST(source.{layer}_pipeline_insert_date AS TIMESTAMP), '1970-01-01 00:00:00') = COALESCE(CAST(target.{layer}_pipeline_insert_date AS TIMESTAMP), '1970-01-01 00:00:00') AND (target.year = year AND target.month = month)
WHEN MATCHED THEN DELETE
"""&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;I get a&amp;nbsp;Container exited with a non-zero exit code 137 after about 20 or so minutes. This error code seems to imply some memory issue.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Py4JJavaError: An error occurred while calling o358.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 44.0 failed 4 times, most recent failure: Lost task 5.3 in stage 44.0 (TID 5891) (vm executor  ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Container from a bad node: container on host: vm-. Exit status: 137. Diagnostics: [2024-12-29 23:55:12.171]Container killed on request. Exit code is 137
[2024-12-29 23:55:12.203]Container exited with a non-zero exit code 137. 
[2024-12-29 23:55:12.212]Killed by external signal&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;BR /&gt;I've tried modifying the workspace environment going from 4 executor small nodes to 10 executor medium nodes and this does not solve the issue either. Does anyone have any recommendations&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Dec 2024 02:46:17 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Container-exited-with-a-non-zero-exit-code-137/m-p/4345721#M5806</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-12-30T02:46:17Z</dc:date>
    </item>
    <item>
      <title>Re: Container exited with a non-zero exit code 137</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Container-exited-with-a-non-zero-exit-code-137/m-p/4403643#M7113</link>
      <description>&lt;P&gt;Were you able to get this issue fixed? I am experiencing the exact same issue. A lot of executors failing with error code 137. I am migrating jobs currently running fine (daily, never once crashing) from Azure Synapse into Fabric. I am using identical pool sizes, but even so - the fabric jobs are crashing left and right.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Even when doubling the spark pool size (going from 3x small nodes to 3x medium nodes) I am seeing similar executor failures. Sometimes the jobs manage to finish, sometimes they pull the livy session down with them and the entire application fails.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Monitoring the spark application memory usage while executing, the executors are only satuated to around 50% memory usage when they die. They do however almost always die when fully utilized on CPU...&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have tried everything; disabled persisting of dataframes, increased overhead memory on executors, ... But no change; Fabric just cant keep my simple jobs alive. And they are simple; reading from delta, saving to delta, working with 100MB-4GB delta tables. This can be run on a potato, but apparently not in Fabric..&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note: I am not using Native Execution engine, because that **bleep** brings it own case of issues to the party. Gluten exploding in my face at every turn.. So this should be as close to 1:1 to Azure Synapse as it gets, I would think..&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Feb 2025 20:08:29 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Container-exited-with-a-non-zero-exit-code-137/m-p/4403643#M7113</guid>
      <dc:creator>FelixL</dc:creator>
      <dc:date>2025-02-10T20:08:29Z</dc:date>
    </item>
  </channel>
</rss>

