Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Did you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now

Reply
lavginqo3
Regular Visitor

How to perform a MERGE using Spark SQL without listing every column or using "blind updates

Hello,

 

I’m trying to sync a Bronze table to a Silver table in a Lakehouse. My source is a full overwrite, so I need to handle inserts, updates, and deletes (specifically records that "vanish" from the source).

 

The Setup: I tried the following standard MERGE logic:

 

SQL

MERGE INTO silver AS t
USING bronze AS s
ON t.id = s.id
WHEN MATCHED AND <condition_to_prevent_blind_update> THEN
UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN
INSERT *
WHEN NOT MATCHED BY SOURCE THEN
DELETE


The Goal: I want to avoid blind updates (updating rows where data hasn't actually changed) to save on compute and time. This is the whole purpose of my optimization, as the table is quite wide.

 

The Problem:

I do not want to manually list 50+ columns in the WHEN MATCHED condition for maintenance reasons.

I tried hash(t.*) != hash(s.*), but Spark SQL does not allow the * wildcard expansion for the target table (t) inside the MERGE condition. It throws a [DELTA_MERGE_UNRESOLVED_EXPRESSION] error.

Question: Is there a straightforward way in Spark SQL to compare the entire row of the target (t) and source (s) without explicitly naming every column? Or is the only way to save on compute/time to use a PySpark wrapper that dynamically generates the hash/column string?

 

Thanks !

1 ACCEPTED SOLUTION
Lodha_Jaydeep
Solution Supplier
Solution Supplier

Hi @lavginqo3 ,
Thanks for reaching fabric community. I am happy to respond you and writing to fix issue you are facing.

 

This is a Spark SQL parse-time limitation. When Spark compiles a MERGE statement, it processes the WHEN MATCHED condition before it resolves the schemas of the tables involved. At that point, t.* is just a raw token. Spark doesn't yet know what columns t contains, so it cannot expand the wildcard. The engine throws the unresolved expression error and aborts.

 

Note: hash(s.*) works fine inside a subquery or a regular SELECT the restriction is specific to the MERGE condition clause.

 

The fix PySpark dynamic hash (no column listing required)

This generates the condition string dynamically so you never manually list columns. maintenance-free even as the schema evolves.

The solution is to let Python expand the columns before the SQL string is ever sent to Spark. By the time Spark sees the query, every column is named explicitly no wildcards, no error. You do need to list each column explicitly.

 

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Step 1: fetch column list from schema — auto-updates if columns change
cols = [c for c in spark.table("bronze").columns if c != "id"]

# Step 2: build hash expressions for both sides
source_hash = f"hash({', '.join(f's.{c}' for c in cols)})"
target_hash = f"hash({', '.join(f't.{c}' for c in cols)})"

# Step 3: generate and run the MERGE
merge_sql = f"""
MERGE INTO silver AS t
USING bronze AS s
ON t.id = s.id
WHEN MATCHED AND {source_hash} != {target_hash}
  THEN UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE
"""

spark.sql(merge_sql)

What Spark actually receives is already fully explicit. e.g. hash(s.name, s.age, s.city, ...) != hash(t.name, t.age, t.city, ...) so there are no wildcards to resolve and no error is thrown.

 

This is the cleanest approach you get full SQL MERGE semantics with zero column maintenance. The hash covers all non-key columns automatically.

 

Why this is maintenance-free

spark.table("bronze").columns always reflects the live schema. If you add, remove, or rename columns in your Bronze table, the hash condition updates automatically on the next run. you never touch this script.

 

If you find this response helpful, kindly consider marking it as the accepted solution and giving it a kudos. This helps others facing similar issues and is greatly appreciated.

View solution in original post

3 REPLIES 3
ati_puri
Resolver II
Resolver II

Hi  , 

Here below is the Fabric best practice approach:- 

Compute a deterministic row hash in Bronze ingestion:

CREATE OR REPLACE TABLE bronze_hashed
AS
SELECT
    *,
    xxhash64(to_json(named_struct(
        'col1', col1,
        'col2', col2,
        'col3', col3
        -- include all business columns EXCEPT metadata
    ))) AS row_hash
FROM bronze;

 

Add a persisted hash column to Silver:
ALTER TABLE silver ADD COLUMN IF NOT EXISTS row_hash BIGINT;
 
Use a hash‑based MERGE:

MERGE INTO silver AS t
USING bronze_hashed AS s
ON t.id = s.id

 

WHEN MATCHED AND t.row_hash <> s.row_hash THEN
  UPDATE SET *

 

WHEN NOT MATCHED BY TARGET THEN
  INSERT *

 

WHEN NOT MATCHED BY SOURCE THEN
  DELETE;

 

Thanks

Ati Puri

@lavginqo3

Lodha_Jaydeep
Solution Supplier
Solution Supplier

Hi @lavginqo3 ,
Thanks for reaching fabric community. I am happy to respond you and writing to fix issue you are facing.

 

This is a Spark SQL parse-time limitation. When Spark compiles a MERGE statement, it processes the WHEN MATCHED condition before it resolves the schemas of the tables involved. At that point, t.* is just a raw token. Spark doesn't yet know what columns t contains, so it cannot expand the wildcard. The engine throws the unresolved expression error and aborts.

 

Note: hash(s.*) works fine inside a subquery or a regular SELECT the restriction is specific to the MERGE condition clause.

 

The fix PySpark dynamic hash (no column listing required)

This generates the condition string dynamically so you never manually list columns. maintenance-free even as the schema evolves.

The solution is to let Python expand the columns before the SQL string is ever sent to Spark. By the time Spark sees the query, every column is named explicitly no wildcards, no error. You do need to list each column explicitly.

 

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Step 1: fetch column list from schema — auto-updates if columns change
cols = [c for c in spark.table("bronze").columns if c != "id"]

# Step 2: build hash expressions for both sides
source_hash = f"hash({', '.join(f's.{c}' for c in cols)})"
target_hash = f"hash({', '.join(f't.{c}' for c in cols)})"

# Step 3: generate and run the MERGE
merge_sql = f"""
MERGE INTO silver AS t
USING bronze AS s
ON t.id = s.id
WHEN MATCHED AND {source_hash} != {target_hash}
  THEN UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE
"""

spark.sql(merge_sql)

What Spark actually receives is already fully explicit. e.g. hash(s.name, s.age, s.city, ...) != hash(t.name, t.age, t.city, ...) so there are no wildcards to resolve and no error is thrown.

 

This is the cleanest approach you get full SQL MERGE semantics with zero column maintenance. The hash covers all non-key columns automatically.

 

Why this is maintenance-free

spark.table("bronze").columns always reflects the live schema. If you add, remove, or rename columns in your Bronze table, the hash condition updates automatically on the next run. you never touch this script.

 

If you find this response helpful, kindly consider marking it as the accepted solution and giving it a kudos. This helps others facing similar issues and is greatly appreciated.

Hi @Lodha_Jaydeep ,

 

Thanks for the detailed info. !

 

I had used the PySpark method for a different project and it worked.

 

I was specifically checking if there is anything in Spark SQL as most of the notebooks are in Spark SQL And also there are fields naming, business logic applied during the Merge which I prefer to do in Spark SQL.

 

Thank you again !

Helpful resources

Announcements
April Fabric Update Carousel

Fabric Monthly Update - April 2026

Check out the April 2026 Fabric update to learn about new features.

Fabric SQL PBI Data Days

Data Days 2026 coming soon!

Sign up to receive a private message when registration opens and key events begin.

New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.