Fabric Data Days Monthly is back. Join us on March 26th for two expert-led sessions on 1) Getting Started with Fabric IQ and 2) Mapping & Spacial Analytics in Fabric. Register now
Getting additional rows in tables when overwrite or merge is applied to delta tables that were previously populated using an upsert Copy activity. These additional rows correspond to earlier versions of existing records rather than new or duplicate records. Although the source data does not contain extra rows, historical versions of records persist in the target table after the overwrite operation, resulting in more rows than expected. How can i solve this issue?
Solved! Go to Solution.
Hi,
I’ve seen similar behavior when combining Fabric Copy Upsert patterns with Delta tables, so I wanted to share a production-grade best practice pattern that has worked reliably in enterprise environments.
From my experience, the issue is usually not caused by Delta itself, but by the interaction between:
• Copy Activity Upsert behavior
• Delta logical vs physical storage model
• Overwrite semantics (table vs partition vs path)
• Delete handling strategy for incremental pipelines
Delta Lake stores data in two layers:
Logical Layer (Snapshot)
What queries return
Latest valid table state
Physical Layer (Files + Transaction Log)
Multiple parquet versions
Historical files kept for time travel and rollback
So operations like MERGE or OVERWRITE:
✔ Update logical table state
❌ Do NOT immediately delete physical old files
Physical cleanup happens only with VACUUM.
Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/delta-vacuum
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-and-delta-tables
Pattern:
1️⃣ Copy Activity Upsert → Delta Table
2️⃣ Later MERGE for delete handling
3️⃣ Later Copy OVERWRITE
4️⃣ Row count mismatch appears
Root causes typically are:
• Copy Upsert does not enforce strict primary key uniqueness
• Partition-level overwrite leaves orphaned files
• Merge key mismatch or NULL handling
• Snapshot vs raw path reads
• Incremental + late arriving records
Instead of:
Source → Copy Upsert → Final Table
Use:
Source → Copy Append → Staging Table → MERGE → Final Table
Why:
• Full control of merge logic
• Deterministic delete handling
• No hidden upsert behavior
Reference:
https://learn.microsoft.com/en-us/fabric/data-factory/copy-data-activity
https://learn.microsoft.com/en-us/azure/databricks/delta/merge
MERGE INTO target t USING staging s ON t.pk = s.pk WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE
Important:
✔ Ensure PK uniqueness in staging
✔ Deduplicate staging before merge
✔ Handle NULL keys explicitly
If overwrite is required:
• Prefer table overwrite
• Avoid raw path overwrite
• Avoid mixed partition + merge strategy
Reference:
https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook
VACUUM is still required — but only for storage cleanup:
Example:
VACUUM table RETAIN 168 HOURS
This:
✔ Keeps 7 days history
✔ Removes obsolete physical files
Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum
✔ Verify merge key uniqueness
✔ Verify snapshot vs path query differences
✔ Verify Copy overwrite scope (table vs partition)
✔ Verify incremental delete strategy
✔ Verify staging deduplication
✔ Schedule OPTIMIZE + VACUUM
Source → Landing → Staging Delta
→ Deduplicate + Data Quality Checks
→ MERGE into Final Delta
→ OPTIMIZE
→ VACUUM (Retention Policy Based)
If you rely on Copy Upsert directly into production Delta tables, you may eventually see:
• Logical duplicates
• Row count drift
• Unexpected historical row persistence
The safest long-term pattern is:
👉 Append to staging
👉 Controlled MERGE into final
👉 Scheduled VACUUM for storage hygiene
Hope this helps.
If helpful, I can also share a reference architecture for incremental + delete handling + retention (7–30 days) using Fabric pipelines + Delta Lake.
Best regards
Hi @Aiswarya11
Thank you for reaching out to the Microsoft Fabric Forum Community.
@bariscihan Thanks for the inputs.
I hope the information provided was helpful. If you still have questions, please don't hesitate to reach out to the community.
Hi @Aiswarya11
Hope everything’s going smoothly on your end. I wanted to check if the issue got sorted. if you have any other issues please reach community.
Hi,
I’ve seen similar behavior when combining Fabric Copy Upsert patterns with Delta tables, so I wanted to share a production-grade best practice pattern that has worked reliably in enterprise environments.
From my experience, the issue is usually not caused by Delta itself, but by the interaction between:
• Copy Activity Upsert behavior
• Delta logical vs physical storage model
• Overwrite semantics (table vs partition vs path)
• Delete handling strategy for incremental pipelines
Delta Lake stores data in two layers:
Logical Layer (Snapshot)
What queries return
Latest valid table state
Physical Layer (Files + Transaction Log)
Multiple parquet versions
Historical files kept for time travel and rollback
So operations like MERGE or OVERWRITE:
✔ Update logical table state
❌ Do NOT immediately delete physical old files
Physical cleanup happens only with VACUUM.
Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/delta-vacuum
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-and-delta-tables
Pattern:
1️⃣ Copy Activity Upsert → Delta Table
2️⃣ Later MERGE for delete handling
3️⃣ Later Copy OVERWRITE
4️⃣ Row count mismatch appears
Root causes typically are:
• Copy Upsert does not enforce strict primary key uniqueness
• Partition-level overwrite leaves orphaned files
• Merge key mismatch or NULL handling
• Snapshot vs raw path reads
• Incremental + late arriving records
Instead of:
Source → Copy Upsert → Final Table
Use:
Source → Copy Append → Staging Table → MERGE → Final Table
Why:
• Full control of merge logic
• Deterministic delete handling
• No hidden upsert behavior
Reference:
https://learn.microsoft.com/en-us/fabric/data-factory/copy-data-activity
https://learn.microsoft.com/en-us/azure/databricks/delta/merge
MERGE INTO target t USING staging s ON t.pk = s.pk WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * WHEN NOT MATCHED BY SOURCE THEN DELETE
Important:
✔ Ensure PK uniqueness in staging
✔ Deduplicate staging before merge
✔ Handle NULL keys explicitly
If overwrite is required:
• Prefer table overwrite
• Avoid raw path overwrite
• Avoid mixed partition + merge strategy
Reference:
https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook
VACUUM is still required — but only for storage cleanup:
Example:
VACUUM table RETAIN 168 HOURS
This:
✔ Keeps 7 days history
✔ Removes obsolete physical files
Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum
✔ Verify merge key uniqueness
✔ Verify snapshot vs path query differences
✔ Verify Copy overwrite scope (table vs partition)
✔ Verify incremental delete strategy
✔ Verify staging deduplication
✔ Schedule OPTIMIZE + VACUUM
Source → Landing → Staging Delta
→ Deduplicate + Data Quality Checks
→ MERGE into Final Delta
→ OPTIMIZE
→ VACUUM (Retention Policy Based)
If you rely on Copy Upsert directly into production Delta tables, you may eventually see:
• Logical duplicates
• Row count drift
• Unexpected historical row persistence
The safest long-term pattern is:
👉 Append to staging
👉 Controlled MERGE into final
👉 Scheduled VACUUM for storage hygiene
Hope this helps.
If helpful, I can also share a reference architecture for incremental + delete handling + retention (7–30 days) using Fabric pipelines + Delta Lake.
Best regards
Hi @Aiswarya11
Thank you for reaching out to the Microsoft Fabric Forum Community.
Your Delta Lake table is keeping old row versions from previous upserts. Overwrite or merge updates the latest data logically, but doesn’t physically remove those older files, so the table ends up with current records plus historical versions, which makes the row count higher than the source. The fix is to ensure your merge/overwrite logic matches records correctly and then run Vacuum(optionally Optimize) to delete obsolete data files. Without Vacuum, Delta keeps accumulating old versions.
If there are any deviations from your expectation please let us know we are happy to address.
Thanks.
Hi @v-priyankata
This is exactly the issue I am facing.
I am performing an upsert on OneLake Delta tables using incremental data. Since we only process incremental loads, there are scenarios where records get deleted from the source system. To handle this, I attempted to run a merge on the Delta table after obtaining a full copy of all primary keys:
MERGE INTO target_tbl
USING src_tbl_keys
ON target_tbl.keys = src_tbl_keys.keys
WHEN NOT MATCHED BY SOURCE THEN DELETE;However, this approach is creating extra records, as you mentioned. Additionally, when I try to perform a copy overwrite on this table using the pipeline copy activity, the extra records still persist in the table. This seems unexpected, as an overwrite should technically replace the entire dataset.
Could you please advise on the best approach to implement merge/overwrite logic so that records are matched and deleted correctly? We also need to retain history for at least 7 days.
Thanks