Get certified for free when you join Fabric Data Days 2026 and dive into Fabric, Power BI, SQL, AI, and other essential data skills.
Join nowData Days is here! Join us now for 60+ days of learning, challenges, and connection. Learn more
Hello community,
I am looking for feedback on a data deletion handling scenario in Microsoft Fabric.
We have a source table with more than 140 million rows, loaded incrementally to capture newly inserted and updated records.
The issue is that some rows are deleted in the source system through hard delete.
Context constraints:
While waiting for feedback from the source team regarding a possible CDC activation (current source is SQL Server, future target is Azure SQL), what approach would you recommend in Fabric to identify deleted rows and keep the target aligned with the source?
I would appreciate any concrete feedback, best practices, or patterns already used in this type of scenario.
Thanks in advance.
Solved! Go to Solution.
Hi @mustaphaben,
This is a tricky one, because with a classic incremental load, inserts and updates are easy… but hard deletes are invisible unless the source explicitly tells you about them (CDC, soft delete flag, tombstone table, etc.).
With 140M+ rows, I would personally avoid any "compare the full table" approach because the capacity cost can become painful very quickly in Fabric.
What I would do as an interim pattern :
Something along these lines :
source_keys = SELECT col1, col2, col3, col4 FROM source
Then:
Silver_delta
ANTI JOIN
source_keys
The result gives you the rows that still exist in the Lakehouse but no longer exist in the source system.
Then you can propagate the delete with a Delta MERGE :
MERGE INTO silver t
USING deleted_keys d ON t.hash_key = d.hash_key
WHEN MATCHED THEN DELETE
For the key itself, I’d probably use something deterministic like :
sh(concat_ws('|', col1, col2, col3, col4), 256)
A few additional thoughts :
Honestly, though, I would still push strongly for CDC on the source side (SQL Server CDC / Azure SQL CDC). That’s the real long-term solution here because deletes become actual events instead of inferred differences.
Interesting topic by the way this is one of those scenarios where the architecture decisions matter much more than the SQL itself 🙂.
Hope this helps a bit 🙂. Don’t hesitate to share your feedback or the solution you finally went with.
Docs :
Hello @mustaphaben,
Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.
The best option is to enable:
Until then, the common Fabric approach is:
Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.
Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.
Useful docs:
About Change Tracking (SQL Server)
What is change data capture (CDC)?
Hi @mustaphaben,
I would push your team hard for that CDC to be enabled. Without it, you're sort of out of luck.
In the event that CDC can't be enabled, then I would follow @Tamanchu's approach probably weekly to clean up any deleted records, however any consumers of your data would need to understand that deleted records will take a week to actually fall out of the reporting.
Proud to be a Super User! | |
Hi @mustaphaben,
I would push your team hard for that CDC to be enabled. Without it, you're sort of out of luck.
In the event that CDC can't be enabled, then I would follow @Tamanchu's approach probably weekly to clean up any deleted records, however any consumers of your data would need to understand that deleted records will take a week to actually fall out of the reporting.
Proud to be a Super User! | |
Hello @mustaphaben,
Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.
The best option is to enable:
Until then, the common Fabric approach is:
Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.
Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.
Useful docs:
About Change Tracking (SQL Server)
What is change data capture (CDC)?
Hi @mustaphaben,
This is a tricky one, because with a classic incremental load, inserts and updates are easy… but hard deletes are invisible unless the source explicitly tells you about them (CDC, soft delete flag, tombstone table, etc.).
With 140M+ rows, I would personally avoid any "compare the full table" approach because the capacity cost can become painful very quickly in Fabric.
What I would do as an interim pattern :
Something along these lines :
source_keys = SELECT col1, col2, col3, col4 FROM source
Then:
Silver_delta
ANTI JOIN
source_keys
The result gives you the rows that still exist in the Lakehouse but no longer exist in the source system.
Then you can propagate the delete with a Delta MERGE :
MERGE INTO silver t
USING deleted_keys d ON t.hash_key = d.hash_key
WHEN MATCHED THEN DELETE
For the key itself, I’d probably use something deterministic like :
sh(concat_ws('|', col1, col2, col3, col4), 256)
A few additional thoughts :
Honestly, though, I would still push strongly for CDC on the source side (SQL Server CDC / Azure SQL CDC). That’s the real long-term solution here because deletes become actual events instead of inferred differences.
Interesting topic by the way this is one of those scenarios where the architecture decisions matter much more than the SQL itself 🙂.
Hope this helps a bit 🙂. Don’t hesitate to share your feedback or the solution you finally went with.
Docs :