Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Days is here! Join us now for 60+ days of learning, challenges, and connection. Learn more

Reply
mustaphaben
Advocate I
Advocate I

Hard delete on a large table with incremental load in Fabric

Hello community,

 

I am looking for feedback on a data deletion handling scenario in Microsoft Fabric.

We have a source table with more than 140 million rows, loaded incrementally to capture newly inserted and updated records.
The issue is that some rows are deleted in the source system through hard delete.

 

Context constraints:

  • no soft delete column;
  • no CDC enabled at this stage;
  • no technical primary key;
  • rows can be identified using a combination of 4 columns;
  • deletions need to be propagated to the Lakehouse, across both Bronze and Silver layers;
  • the goal is to avoid an approach that would be too expensive in terms of capacity on very large volumes.

 

While waiting for feedback from the source team regarding a possible CDC activation (current source is SQL Server, future target is Azure SQL), what approach would you recommend in Fabric to identify deleted rows and keep the target aligned with the source?

 

I would appreciate any concrete feedback, best practices, or patterns already used in this type of scenario.

 

Thanks in advance.

3 ACCEPTED SOLUTIONS
Tamanchu
Impactful Individual
Impactful Individual

Hi @mustaphaben,

 

This is a tricky one, because with a classic incremental load, inserts and updates are easy… but hard deletes are invisible unless the source explicitly tells you about them (CDC, soft delete flag, tombstone table, etc.).

With 140M+ rows, I would personally avoid any "compare the full table" approach because the capacity cost can become painful very quickly in Fabric.

What I would do as an interim pattern :

  • Keep the normal incremental MERGE for inserts/updates
  • Build a deterministic business key from the 4 identifying columns (preferably a hash)
  • Extract only the source keys (not the full rows)
  • Run a periodic "delete sweep" using an anti-join

Something along these lines :


source_keys = SELECT col1, col2, col3, col4 FROM source

Then:


Silver_delta
ANTI JOIN
source_keys

The result gives you the rows that still exist in the Lakehouse but no longer exist in the source system.

Then you can propagate the delete with a Delta MERGE :

 

MERGE INTO silver t
USING deleted_keys d ON t.hash_key = d.hash_key
WHEN MATCHED THEN DELETE

For the key itself, I’d probably use something deterministic like :

 

sh(concat_ws('|', col1, col2, col3, col4), 256)

A few additional thoughts :

  • I would avoid scanning the entire table daily if possible
  • If the data can be partitioned (business date, ingestion window, etc.), only sweep relevant partitions*
  • Depending on governance/audit requirements, Bronze may stay append-only with delete tombstones (is_deleted) while physical deletes happen in Silver only
  • After large MERGEs/deletes, OPTIMIZE + V-Order can help keep Delta performance healthy in Fabric

Honestly, though, I would still push strongly for CDC on the source side (SQL Server CDC / Azure SQL CDC). That’s the real long-term solution here because deletes become actual events instead of inferred differences.

Interesting topic by the way this is one of those scenarios where the architecture decisions matter much more than the SQL itself 🙂.

 

Hope this helps a bit 🙂. Don’t hesitate to share your feedback or the solution you finally went with.

 

Docs :

View solution in original post

Olufemi7
Solution Sage
Solution Sage

Hello @mustaphaben,

Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.

The best option is to enable:

  • SQL Server CDC
  • or SQL Change Tracking

Until then, the common Fabric approach is:

  • keep incremental upserts for inserts/updates;
  • run a separate periodic reconciliation process for deletes;
  • compare source business keys with the Lakehouse using a LEFT ANTI JOIN;
  • delete orphaned rows from Silver (and Bronze if required).

Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.

Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.

Useful docs:

About Change Tracking (SQL Server)

 What is change data capture (CDC)? 

 

View solution in original post

tayloramy
Super User
Super User

Hi @mustaphaben
I would push your team hard for that CDC to be enabled. Without it, you're sort of out of luck. 

In the event that CDC can't be enabled, then I would follow @Tamanchu's approach probably weekly to clean up any deleted records, however any consumers of your data would need to understand that deleted records will take a week to actually fall out of the reporting.  





If you found this helpful, consider giving some Kudos.
If I answered your question or solved your problem, mark this post as the solution!

Join the Fabric Discord!

Proud to be a Super User!





View solution in original post

3 REPLIES 3
tayloramy
Super User
Super User

Hi @mustaphaben
I would push your team hard for that CDC to be enabled. Without it, you're sort of out of luck. 

In the event that CDC can't be enabled, then I would follow @Tamanchu's approach probably weekly to clean up any deleted records, however any consumers of your data would need to understand that deleted records will take a week to actually fall out of the reporting.  





If you found this helpful, consider giving some Kudos.
If I answered your question or solved your problem, mark this post as the solution!

Join the Fabric Discord!

Proud to be a Super User!





Olufemi7
Solution Sage
Solution Sage

Hello @mustaphaben,

Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.

The best option is to enable:

  • SQL Server CDC
  • or SQL Change Tracking

Until then, the common Fabric approach is:

  • keep incremental upserts for inserts/updates;
  • run a separate periodic reconciliation process for deletes;
  • compare source business keys with the Lakehouse using a LEFT ANTI JOIN;
  • delete orphaned rows from Silver (and Bronze if required).

Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.

Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.

Useful docs:

About Change Tracking (SQL Server)

 What is change data capture (CDC)? 

 

Tamanchu
Impactful Individual
Impactful Individual

Hi @mustaphaben,

 

This is a tricky one, because with a classic incremental load, inserts and updates are easy… but hard deletes are invisible unless the source explicitly tells you about them (CDC, soft delete flag, tombstone table, etc.).

With 140M+ rows, I would personally avoid any "compare the full table" approach because the capacity cost can become painful very quickly in Fabric.

What I would do as an interim pattern :

  • Keep the normal incremental MERGE for inserts/updates
  • Build a deterministic business key from the 4 identifying columns (preferably a hash)
  • Extract only the source keys (not the full rows)
  • Run a periodic "delete sweep" using an anti-join

Something along these lines :


source_keys = SELECT col1, col2, col3, col4 FROM source

Then:


Silver_delta
ANTI JOIN
source_keys

The result gives you the rows that still exist in the Lakehouse but no longer exist in the source system.

Then you can propagate the delete with a Delta MERGE :

 

MERGE INTO silver t
USING deleted_keys d ON t.hash_key = d.hash_key
WHEN MATCHED THEN DELETE

For the key itself, I’d probably use something deterministic like :

 

sh(concat_ws('|', col1, col2, col3, col4), 256)

A few additional thoughts :

  • I would avoid scanning the entire table daily if possible
  • If the data can be partitioned (business date, ingestion window, etc.), only sweep relevant partitions*
  • Depending on governance/audit requirements, Bronze may stay append-only with delete tombstones (is_deleted) while physical deletes happen in Silver only
  • After large MERGEs/deletes, OPTIMIZE + V-Order can help keep Delta performance healthy in Fabric

Honestly, though, I would still push strongly for CDC on the source side (SQL Server CDC / Azure SQL CDC). That’s the real long-term solution here because deletes become actual events instead of inferred differences.

Interesting topic by the way this is one of those scenarios where the architecture decisions matter much more than the SQL itself 🙂.

 

Hope this helps a bit 🙂. Don’t hesitate to share your feedback or the solution you finally went with.

 

Docs :

Helpful resources

Announcements
Fabric Data Days is here Carousel

Fabric Data Days 2026

Don't miss out on Data Days, June 15 through August 7. Learn Fabric, Power BI, SQL, AI and more.

June Fabric Update Carousel

Fabric Monthly Update - June 2026

Check out the June 2026 Fabric update to learn about new features.