Solved: MERGE - Notebook cell giving error while doing dat...

akakkar · ‎05-19-2025

I am trying to run the merge in the notebook. I am using optimize and merge low shuffle enabled as true in the cell. The cell executes successfully sometimes but sometimes it throws error. When I play around with disabling the optimize or shuffle randomly like true, false or false, false or false, true , the cell executes again. Please can you guide me what might be the reason behind this issue.

The code in the cell of the notebook is as below:

spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")

spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "true")

if account_table_exists:

target_table = DeltaTable.forPath(spark,TARGET_ACCOUNT_PATH)

try:

columns_to_update = [col for col in df_fullaccount.columns if col != "AccountId"]

set_dict = {f"{col}": f"source.{col}" for col in columns_to_update}

target_table.alias("target").merge(

df_fullaccount.alias("source"),"target.AccountId = source.AccountId"

).whenNotMatchedInsertAll()\

.whenMatchedUpdate(set=set_dict)\

.execute()

print("Merge operation completed successfully.")

except Exception as e:

#print(f"An error occurred during the merge operation: {e}")

raise RuntimeError("An error occurred during the merge operation") from e

if not account_table_exists:

try:

df_fullaccount.write.format("delta").mode("overwrite").save(TARGET_ACCOUNT_PATH)

print("Account table created and loaded.")

except Exception as e:

print(f"An error occurred during the table create/table load operation: {e}")

raise RuntimeError("An error occurred during the table create/table load operation") from e

if DEBUG==True & account_table_exists:

print(columns_to_update)

print(set_dict)

ERROR

--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In[56], line 13 8 set_dict = {f"{col}": f"source.{col}" for col in columns_to_update} 9 target_table.alias("target").merge( 10 df_fullaccount.alias("source"),"target.AccountId = source.AccountId" 11 ).whenNotMatchedInsertAll()\ 12 .whenMatchedUpdate(set=set_dict)\ ---> 13 .execute() 14 print("Merge operation completed successfully.") File /usr/hdp/current/spark3-client/jars/delta-spark_2.12-3.2.0.19.jar/delta/tables.py:1065, in DeltaMergeBuilder.execute(self) 1060 """ 1061 Execute the merge operation based on the built matched and not matched actions. 1062 1063 See :py:class:`~delta.tables.DeltaMergeBuilder` for complete usage details. 1064 """ -> 1065 self._jbuilder.execute() File ~/cluster-env/trident_env/lib/python3.11/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw) 178 try: --> 179 return f(*a, **kw) 180 except Py4JJavaError as e: File ~/cluster-env/trident_env/lib/python3.11/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: Py4JJavaError: An error occurred while calling o19020.execute.

v-lgarikapat · ‎05-20-2025

Hi @akakkar ,

Thanks for reaching out to the Microsoft fabric community forum.

Thanks for sharing the details and error trace. Based on your description and the stack trace (Py4JJavaError during .merge().execute()), the issue seems to be caused by one or more of the following factors, especially in combination with the Delta Lake optimization settings you're using.
Data Skew or Partitioning Issues
Using spark.microsoft.delta.merge.lowShuffle.enabled=true can lead to shuffle or skew-related failures if the data is not evenly distributed (e.g., if AccountId has many repeated values).
Incompatible Schema or Column Mapping
The set_dict used for updating may include columns with mismatched data types or missing in either the source or target table, which can cause merge execution to fail.
Delta Table State or Concurrency Conflicts
If another process is writing to the Delta table or if there's a stale transaction log, the merge might fail intermittently.
Resource Constraints or Executor Failures
Optimizations like optimizeWrite and lowShuffle can increase memory or shuffle pressure, causing sporadic failures under certain cluster loads.
Disable Optimizations for Isolation Testing
Try disabling the optimizations to confirm stability:
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "false")
spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "false")
Validate Column Mappings
Ensure all columns in set_dict exist in both DataFrames and their data types match.
print(df_fullaccount.dtypes)
print(target_table.toDF().dtypes)
Repartition Input Data
Improve distribution to prevent skew:
df_fullaccount = df_fullaccount.repartition("AccountId")
Add Full Error Logging
To see the actual Java exception behind Py4JJavaError, use:
import traceback
traceback.print_exc()
Check Spark UI Logs
If the above doesn’t isolate the issue, please review the executor logs in the Spark UI for detailed exception traces.

Low Shuffle Merge optimization on Delta tables - Azure Synapse Analytics | Microsoft Learn

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Tutorial: Delta Lake - Azure Databricks | Microsoft Learn

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.

We appreciate your engagement and thank you for being an active part of the community.

Best regards,
LakshmiNarayana

View solution in original post

v-lgarikapat · ‎05-25-2025

Hi @akakkar ,

If your issue has been resolved, please consider marking the most helpful reply as the accepted solution. This helps other community members who may encounter the same issue to find answers more efficiently.

If you're still facing challenges, feel free to let us know—we’ll be glad to assist you further.

Looking forward to your response.

Best regards,
LakshmiNarayana.

v-lgarikapat · ‎05-29-2025

Hi @akakkar ,

If your question has been answered, kindly mark the appropriate response as the Accepted Solution. This small step goes a long way in helping others with similar issues.

We appreciate your collaboration and support!

Best regards,
LakshmiNarayana

v-lgarikapat · ‎06-01-2025

Hi @akakkar ,

As we haven't heard back from you, we are closing this thread. If you are still experiencing the same issue, we kindly request you to create a new thread we’ll be happy to assist you further.

Thank you for your patience and support.

If our response was helpful, please mark it as Accepted as Solution and consider giving a Kudos. Feel free to reach out if you need any further assistance.
Best Regards,

Lakshmi Narayana

v-lgarikapat · ‎05-20-2025

Hi @akakkar ,

Thanks for reaching out to the Microsoft fabric community forum.

Thanks for sharing the details and error trace. Based on your description and the stack trace (Py4JJavaError during .merge().execute()), the issue seems to be caused by one or more of the following factors, especially in combination with the Delta Lake optimization settings you're using.
Data Skew or Partitioning Issues
Using spark.microsoft.delta.merge.lowShuffle.enabled=true can lead to shuffle or skew-related failures if the data is not evenly distributed (e.g., if AccountId has many repeated values).
Incompatible Schema or Column Mapping
The set_dict used for updating may include columns with mismatched data types or missing in either the source or target table, which can cause merge execution to fail.
Delta Table State or Concurrency Conflicts
If another process is writing to the Delta table or if there's a stale transaction log, the merge might fail intermittently.
Resource Constraints or Executor Failures
Optimizations like optimizeWrite and lowShuffle can increase memory or shuffle pressure, causing sporadic failures under certain cluster loads.
Disable Optimizations for Isolation Testing
Try disabling the optimizations to confirm stability:
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "false")
spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "false")
Validate Column Mappings
Ensure all columns in set_dict exist in both DataFrames and their data types match.
print(df_fullaccount.dtypes)
print(target_table.toDF().dtypes)
Repartition Input Data
Improve distribution to prevent skew:
df_fullaccount = df_fullaccount.repartition("AccountId")
Add Full Error Logging
To see the actual Java exception behind Py4JJavaError, use:
import traceback
traceback.print_exc()
Check Spark UI Logs
If the above doesn’t isolate the issue, please review the executor logs in the Spark UI for detailed exception traces.

Low Shuffle Merge optimization on Delta tables - Azure Synapse Analytics | Microsoft Learn

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Tutorial: Delta Lake - Azure Databricks | Microsoft Learn

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.

We appreciate your engagement and thank you for being an active part of the community.

Best regards,
LakshmiNarayana

MERGE - Notebook cell giving error while doing data merge in table using Optimize and Shuffle.

Helpful resources

Fabric Community Update - July 2025

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Party with Power BI’s own Guy in a Cube

MERGE - Notebook cell giving error while doing data merge in table using Optimize and Shuffle.

Helpful resources

Fabric Community Update - July 2025

Join our Fabric User Panel

Fabric Monthly Update - June 2025