Container exited with a non-zero exit code 137

Anonymous — Mon, 30 Dec 2024 02:46:17 GMT

Hi I have a Delta table with 252,322,508 rows. The data contains some duplicates. I have a merge statement that deletes these duplicates (I can easily identify them with a query and it's around 500k duplicate rows). I have tried liquid clustering, partitioning on year and month columns but each time i run a merge command along the lines of:

delete_duplicates_sql = f""" MERGE INTO delta.`{target_table_path}` AS target USING ( SELECT * FROM RankedRowsToDelete ) AS source ON source.{target_id} = target.{target_id} AND {watermark_join_on_expression} AND COALESCE(CAST(source.{layer}_pipeline_insert_date AS TIMESTAMP), '1970-01-01 00:00:00') = COALESCE(CAST(target.{layer}_pipeline_insert_date AS TIMESTAMP), '1970-01-01 00:00:00') AND (target.year = year AND target.month = month) WHEN MATCHED THEN DELETE """

I get a Container exited with a non-zero exit code 137 after about 20 or so minutes. This error code seems to imply some memory issue.

Py4JJavaError: An error occurred while calling o358.sql. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 44.0 failed 4 times, most recent failure: Lost task 5.3 in stage 44.0 (TID 5891) (vm executor ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Container from a bad node: container on host: vm-. Exit status: 137. Diagnostics: [2024-12-29 23:55:12.171]Container killed on request. Exit code is 137 [2024-12-29 23:55:12.203]Container exited with a non-zero exit code 137. [2024-12-29 23:55:12.212]Killed by external signal

I've tried modifying the workspace environment going from 4 executor small nodes to 10 executor medium nodes and this does not solve the issue either. Does anyone have any recommendations

Re: Container exited with a non-zero exit code 137

FelixL — Mon, 10 Feb 2025 20:08:29 GMT

Were you able to get this issue fixed? I am experiencing the exact same issue. A lot of executors failing with error code 137. I am migrating jobs currently running fine (daily, never once crashing) from Azure Synapse into Fabric. I am using identical pool sizes, but even so - the fabric jobs are crashing left and right.

Even when doubling the spark pool size (going from 3x small nodes to 3x medium nodes) I am seeing similar executor failures. Sometimes the jobs manage to finish, sometimes they pull the livy session down with them and the entire application fails.

Monitoring the spark application memory usage while executing, the executors are only satuated to around 50% memory usage when they die. They do however almost always die when fully utilized on CPU...

I have tried everything; disabled persisting of dataframes, increased overhead memory on executors, ... But no change; Fabric just cant keep my simple jobs alive. And they are simple; reading from delta, saving to delta, working with 100MB-4GB delta tables. This can be run on a potato, but apparently not in Fabric..

Note: I am not using Native Execution engine, because that **bleep** brings it own case of issues to the party. Gluten exploding in my face at every turn.. So this should be as close to 1:1 to Azure Synapse as it gets, I would think..

topic Re: Container exited with a non-zero exit code 137 in Data Engineering

Container exited with a non-zero exit code 137

Re: Container exited with a non-zero exit code 137