Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Fabric Data Days Monthly is back. Join us on March 26th for two expert-led sessions on 1) Getting Started with Fabric IQ and 2) Mapping & Spacial Analytics in Fabric. Register now

Reply
Aiswarya11
Regular Visitor

Overwrite operations on tables populated via upsert copy activities result in duplicate records

Getting additional rows in tables when overwrite or merge is applied to delta tables that were previously populated using an upsert Copy activity. These additional rows correspond to earlier versions of existing records rather than new or duplicate records. Although the source data does not contain extra rows, historical versions of records persist in the target table after the overwrite operation, resulting in more rows than expected. How can i solve this issue?

1 ACCEPTED SOLUTION
bariscihan
Resolver II
Resolver II

Hi,

I’ve seen similar behavior when combining Fabric Copy Upsert patterns with Delta tables, so I wanted to share a production-grade best practice pattern that has worked reliably in enterprise environments.

From my experience, the issue is usually not caused by Delta itself, but by the interaction between:

• Copy Activity Upsert behavior
• Delta logical vs physical storage model
• Overwrite semantics (table vs partition vs path)
• Delete handling strategy for incremental pipelines


🔍 How Delta Actually Behaves (Important Context)

Delta Lake stores data in two layers:

Logical Layer (Snapshot)

  • What queries return

  • Latest valid table state

Physical Layer (Files + Transaction Log)

  • Multiple parquet versions

  • Historical files kept for time travel and rollback

So operations like MERGE or OVERWRITE:
✔ Update logical table state
Do NOT immediately delete physical old files

Physical cleanup happens only with VACUUM.

Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/delta-vacuum
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-and-delta-tables


⚠️ Common Anti-Pattern (What Usually Causes “Extra Rows”)

Pattern:
1️⃣ Copy Activity Upsert → Delta Table
2️⃣ Later MERGE for delete handling
3️⃣ Later Copy OVERWRITE
4️⃣ Row count mismatch appears

Root causes typically are:

• Copy Upsert does not enforce strict primary key uniqueness
• Partition-level overwrite leaves orphaned files
• Merge key mismatch or NULL handling
• Snapshot vs raw path reads
• Incremental + late arriving records


Recommended Fabric Production Pattern


🥇 Pattern 1 — Always Stage Before Merge (Golden Pattern)

Instead of:
Source → Copy Upsert → Final Table

Use:
Source → Copy Append → Staging Table → MERGE → Final Table

Why:
• Full control of merge logic
• Deterministic delete handling
• No hidden upsert behavior

Reference:
https://learn.microsoft.com/en-us/fabric/data-factory/copy-data-activity
https://learn.microsoft.com/en-us/azure/databricks/delta/merge


🥈 Pattern 2 — Deterministic MERGE Template

MERGE INTO target t
USING staging s
ON t.pk = s.pk

WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE

Important:
✔ Ensure PK uniqueness in staging
✔ Deduplicate staging before merge
✔ Handle NULL keys explicitly


🥉 Pattern 3 — Partition Safe Overwrite (If Needed)

If overwrite is required:

• Prefer table overwrite
• Avoid raw path overwrite
• Avoid mixed partition + merge strategy

Reference:
https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook


🧹 Storage Maintenance Strategy

VACUUM is still required — but only for storage cleanup:

Example:

VACUUM table RETAIN 168 HOURS

This:
✔ Keeps 7 days history
✔ Removes obsolete physical files

Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum


🧪 Validation Checklist I Use in Production

✔ Verify merge key uniqueness
✔ Verify snapshot vs path query differences
✔ Verify Copy overwrite scope (table vs partition)
✔ Verify incremental delete strategy
✔ Verify staging deduplication
✔ Schedule OPTIMIZE + VACUUM


Enterprise-Grade Pipeline Flow (Recommended)

Source → Landing → Staging Delta
→ Deduplicate + Data Quality Checks
→ MERGE into Final Delta
→ OPTIMIZE
→ VACUUM (Retention Policy Based)


🚀 Key Takeaway

If you rely on Copy Upsert directly into production Delta tables, you may eventually see:

• Logical duplicates
• Row count drift
• Unexpected historical row persistence

The safest long-term pattern is:

👉 Append to staging
👉 Controlled MERGE into final
👉 Scheduled VACUUM for storage hygiene


Hope this helps.
If helpful, I can also share a reference architecture for incremental + delete handling + retention (7–30 days) using Fabric pipelines + Delta Lake.

Best regards

View solution in original post

5 REPLIES 5
v-priyankata
Community Support
Community Support

Hi @Aiswarya11 

Thank you for reaching out to the Microsoft Fabric Forum Community.

@bariscihan Thanks for the inputs.

I hope the information provided was helpful. If you still have questions, please don't hesitate to reach out to the community.

 

Hi @Aiswarya11 

Hope everything’s going smoothly on your end. I wanted to check if the issue got sorted. if you have any other issues please reach community.

bariscihan
Resolver II
Resolver II

Hi,

I’ve seen similar behavior when combining Fabric Copy Upsert patterns with Delta tables, so I wanted to share a production-grade best practice pattern that has worked reliably in enterprise environments.

From my experience, the issue is usually not caused by Delta itself, but by the interaction between:

• Copy Activity Upsert behavior
• Delta logical vs physical storage model
• Overwrite semantics (table vs partition vs path)
• Delete handling strategy for incremental pipelines


🔍 How Delta Actually Behaves (Important Context)

Delta Lake stores data in two layers:

Logical Layer (Snapshot)

  • What queries return

  • Latest valid table state

Physical Layer (Files + Transaction Log)

  • Multiple parquet versions

  • Historical files kept for time travel and rollback

So operations like MERGE or OVERWRITE:
✔ Update logical table state
Do NOT immediately delete physical old files

Physical cleanup happens only with VACUUM.

Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/delta-vacuum
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-and-delta-tables


⚠️ Common Anti-Pattern (What Usually Causes “Extra Rows”)

Pattern:
1️⃣ Copy Activity Upsert → Delta Table
2️⃣ Later MERGE for delete handling
3️⃣ Later Copy OVERWRITE
4️⃣ Row count mismatch appears

Root causes typically are:

• Copy Upsert does not enforce strict primary key uniqueness
• Partition-level overwrite leaves orphaned files
• Merge key mismatch or NULL handling
• Snapshot vs raw path reads
• Incremental + late arriving records


Recommended Fabric Production Pattern


🥇 Pattern 1 — Always Stage Before Merge (Golden Pattern)

Instead of:
Source → Copy Upsert → Final Table

Use:
Source → Copy Append → Staging Table → MERGE → Final Table

Why:
• Full control of merge logic
• Deterministic delete handling
• No hidden upsert behavior

Reference:
https://learn.microsoft.com/en-us/fabric/data-factory/copy-data-activity
https://learn.microsoft.com/en-us/azure/databricks/delta/merge


🥈 Pattern 2 — Deterministic MERGE Template

MERGE INTO target t
USING staging s
ON t.pk = s.pk

WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE

Important:
✔ Ensure PK uniqueness in staging
✔ Deduplicate staging before merge
✔ Handle NULL keys explicitly


🥉 Pattern 3 — Partition Safe Overwrite (If Needed)

If overwrite is required:

• Prefer table overwrite
• Avoid raw path overwrite
• Avoid mixed partition + merge strategy

Reference:
https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook


🧹 Storage Maintenance Strategy

VACUUM is still required — but only for storage cleanup:

Example:

VACUUM table RETAIN 168 HOURS

This:
✔ Keeps 7 days history
✔ Removes obsolete physical files

Reference:
https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum


🧪 Validation Checklist I Use in Production

✔ Verify merge key uniqueness
✔ Verify snapshot vs path query differences
✔ Verify Copy overwrite scope (table vs partition)
✔ Verify incremental delete strategy
✔ Verify staging deduplication
✔ Schedule OPTIMIZE + VACUUM


Enterprise-Grade Pipeline Flow (Recommended)

Source → Landing → Staging Delta
→ Deduplicate + Data Quality Checks
→ MERGE into Final Delta
→ OPTIMIZE
→ VACUUM (Retention Policy Based)


🚀 Key Takeaway

If you rely on Copy Upsert directly into production Delta tables, you may eventually see:

• Logical duplicates
• Row count drift
• Unexpected historical row persistence

The safest long-term pattern is:

👉 Append to staging
👉 Controlled MERGE into final
👉 Scheduled VACUUM for storage hygiene


Hope this helps.
If helpful, I can also share a reference architecture for incremental + delete handling + retention (7–30 days) using Fabric pipelines + Delta Lake.

Best regards

v-priyankata
Community Support
Community Support

Hi @Aiswarya11 

Thank you for reaching out to the Microsoft Fabric Forum Community.

 

Your Delta Lake table is keeping old row versions from previous upserts. Overwrite or merge updates the latest data logically, but doesn’t physically remove those older files, so the table ends up with current records plus historical versions,  which makes the row count higher than the source. The fix is to ensure your merge/overwrite logic matches records correctly and then run Vacuum(optionally Optimize) to delete obsolete data files. Without Vacuum, Delta keeps accumulating old versions.

 

If there are any deviations from your expectation please let us know we are happy to address. 

Thanks.

Hi @v-priyankata 
This is exactly the issue I am facing.

 

I am performing an upsert on OneLake Delta tables using incremental data. Since we only process incremental loads, there are scenarios where records get deleted from the source system. To handle this, I attempted to run a merge on the Delta table after obtaining a full copy of all primary keys:

MERGE INTO target_tbl
USING src_tbl_keys
ON target_tbl.keys = src_tbl_keys.keys
WHEN NOT MATCHED BY SOURCE THEN DELETE;

However, this approach is creating extra records, as you mentioned. Additionally, when I try to perform a copy overwrite on this table using the pipeline copy activity, the extra records still persist in the table. This seems unexpected, as an overwrite should technically replace the entire dataset.

 

 Could you please advise on the best approach to implement merge/overwrite logic so that records are matched and deleted correctly? We also need to retain history for at least 7 days.

 

Thanks

 

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.

February Fabric Update Carousel

Fabric Monthly Update - February 2026

Check out the February 2026 Fabric update to learn about new features.

Top Solution Authors
Top Kudoed Authors