Solved: Re: How to perform a MERGE using Spark SQL without...

lavginqo3 · ‎04-09-2026

Hello,

I’m trying to sync a Bronze table to a Silver table in a Lakehouse. My source is a full overwrite, so I need to handle inserts, updates, and deletes (specifically records that "vanish" from the source).

The Setup: I tried the following standard MERGE logic:

SQL

MERGE INTO silver AS t
USING bronze AS s
ON t.id = s.id
WHEN MATCHED AND <condition_to_prevent_blind_update> THEN
UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN
INSERT *
WHEN NOT MATCHED BY SOURCE THEN
DELETE

The Goal: I want to avoid blind updates (updating rows where data hasn't actually changed) to save on compute and time. This is the whole purpose of my optimization, as the table is quite wide.

The Problem:

I do not want to manually list 50+ columns in the WHEN MATCHED condition for maintenance reasons.

I tried hash(t.*) != hash(s.*), but Spark SQL does not allow the * wildcard expansion for the target table (t) inside the MERGE condition. It throws a [DELTA_MERGE_UNRESOLVED_EXPRESSION] error.

Question: Is there a straightforward way in Spark SQL to compare the entire row of the target (t) and source (s) without explicitly naming every column? Or is the only way to save on compute/time to use a PySpark wrapper that dynamically generates the hash/column string?

Thanks !

Lodha_Jaydeep · ‎04-10-2026

Hi @lavginqo3 ,
Thanks for reaching fabric community. I am happy to respond you and writing to fix issue you are facing.

This is a Spark SQL parse-time limitation. When Spark compiles a MERGE statement, it processes the WHEN MATCHED condition before it resolves the schemas of the tables involved. At that point, t.* is just a raw token. Spark doesn't yet know what columns t contains, so it cannot expand the wildcard. The engine throws the unresolved expression error and aborts.

Note: hash(s.*) works fine inside a subquery or a regular SELECT the restriction is specific to the MERGE condition clause.

The fix PySpark dynamic hash (no column listing required)

This generates the condition string dynamically so you never manually list columns. maintenance-free even as the schema evolves.

The solution is to let Python expand the columns before the SQL string is ever sent to Spark. By the time Spark sees the query, every column is named explicitly no wildcards, no error. You do need to list each column explicitly.

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Step 1: fetch column list from schema — auto-updates if columns change
cols = [c for c in spark.table("bronze").columns if c != "id"]

# Step 2: build hash expressions for both sides
source_hash = f"hash({', '.join(f's.{c}' for c in cols)})"
target_hash = f"hash({', '.join(f't.{c}' for c in cols)})"

# Step 3: generate and run the MERGE
merge_sql = f"""
MERGE INTO silver AS t
USING bronze AS s
ON t.id = s.id
WHEN MATCHED AND {source_hash} != {target_hash}
  THEN UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE
"""

spark.sql(merge_sql)

What Spark actually receives is already fully explicit. e.g. hash(s.name, s.age, s.city, ...) != hash(t.name, t.age, t.city, ...) so there are no wildcards to resolve and no error is thrown.

This is the cleanest approach you get full SQL MERGE semantics with zero column maintenance. The hash covers all non-key columns automatically.

Why this is maintenance-free

spark.table("bronze").columns always reflects the live schema. If you add, remove, or rename columns in your Bronze table, the hash condition updates automatically on the next run. you never touch this script.

If you find this response helpful, kindly consider marking it as the accepted solution and giving it a kudos. This helps others facing similar issues and is greatly appreciated.

View solution in original post

ati_puri · ‎04-10-2026

Hi ,

Here below is the Fabric best practice approach:-

Compute a deterministic row hash in Bronze ingestion:

CREATE OR REPLACE TABLE bronze_hashed

AS

SELECT

*,

xxhash64(to_json(named_struct(

'col1', col1,

'col2', col2,

'col3', col3

-- include all business columns EXCEPT metadata

))) AS row_hash

FROM bronze;

Add a persisted hash column to Silver:

ALTER TABLE silver ADD COLUMN IF NOT EXISTS row_hash BIGINT;

Use a hash‑based MERGE:

MERGE INTO silver AS t

USING bronze_hashed AS s

ON t.id = s.id

WHEN MATCHED AND t.row_hash <> s.row_hash THEN

UPDATE SET *

WHEN NOT MATCHED BY TARGET THEN

INSERT *

WHEN NOT MATCHED BY SOURCE THEN

DELETE;

Thanks

Ati Puri

@lavginqo3

Lodha_Jaydeep · ‎04-10-2026