topic Re: Hard delete on a large table with incremental load in Fabric in Data Engineering

Hard delete on a large table with incremental load in Fabric

mustaphaben — Mon, 11 May 2026 21:55:57 GMT

Hello community,

I am looking for feedback on a data deletion handling scenario in Microsoft Fabric.

We have a source table with more than 140 million rows, loaded incrementally to capture newly inserted and updated records.
The issue is that some rows are deleted in the source system through hard delete.

Context constraints:

no soft delete column;
no CDC enabled at this stage;
no technical primary key;
rows can be identified using a combination of 4 columns;
deletions need to be propagated to the Lakehouse, across both Bronze and Silver layers;
the goal is to avoid an approach that would be too expensive in terms of capacity on very large volumes.

While waiting for feedback from the source team regarding a possible CDC activation (current source is SQL Server, future target is Azure SQL), what approach would you recommend in Fabric to identify deleted rows and keep the target aligned with the source?

I would appreciate any concrete feedback, best practices, or patterns already used in this type of scenario.

Thanks in advance.

Re: Hard delete on a large table with incremental load in Fabric

Tamanchu — Tue, 12 May 2026 01:24:37 GMT

Hi @mustaphaben,

This is a tricky one, because with a classic incremental load, inserts and updates are easy… but hard deletes are invisible unless the source explicitly tells you about them (CDC, soft delete flag, tombstone table, etc.).

With 140M+ rows, I would personally avoid any "compare the full table" approach because the capacity cost can become painful very quickly in Fabric.

What I would do as an interim pattern :

Keep the normal incremental MERGE for inserts/updates
Build a deterministic business key from the 4 identifying columns (preferably a hash)
Extract only the source keys (not the full rows)
Run a periodic "delete sweep" using an anti-join

Something along these lines :

source_keys = SELECT col1, col2, col3, col4 FROM source

Then:

Silver_delta
ANTI JOIN
source_keys

The result gives you the rows that still exist in the Lakehouse but no longer exist in the source system.

Then you can propagate the delete with a Delta MERGE :

MERGE INTO silver t
USING deleted_keys d ON t.hash_key = d.hash_key
WHEN MATCHED THEN DELETE

For the key itself, I’d probably use something deterministic like :

sh(concat_ws('|', col1, col2, col3, col4), 256)

A few additional thoughts :

I would avoid scanning the entire table daily if possible
If the data can be partitioned (business date, ingestion window, etc.), only sweep relevant partitions*
Depending on governance/audit requirements, Bronze may stay append-only with delete tombstones (is_deleted) while physical deletes happen in Silver only
After large MERGEs/deletes, OPTIMIZE + V-Order can help keep Delta performance healthy in Fabric

Honestly, though, I would still push strongly for CDC on the source side (SQL Server CDC / Azure SQL CDC). That’s the real long-term solution here because deletes become actual events instead of inferred differences.

Interesting topic by the way this is one of those scenarios where the architecture decisions matter much more than the SQL itself 🙂.

Hope this helps a bit 🙂. Don’t hesitate to share your feedback or the solution you finally went with.

Docs :

Re: Hard delete on a large table with incremental load in Fabric

Olufemi7 — Tue, 12 May 2026 07:52:12 GMT

Hello @mustaphaben,

Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.

The best option is to enable:

SQL Server CDC
or SQL Change Tracking

Until then, the common Fabric approach is:

keep incremental upserts for inserts/updates;
run a separate periodic reconciliation process for deletes;
compare source business keys with the Lakehouse using a LEFT ANTI JOIN;
delete orphaned rows from Silver (and Bronze if required).

Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.

Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.

Useful docs:

About Change Tracking (SQL Server)

What is change data capture (CDC)?

Re: Hard delete on a large table with incremental load in Fabric

tayloramy — Tue, 12 May 2026 13:49:41 GMT

Hi @mustaphaben,
I would push your team hard for that CDC to be enabled. Without it, you're sort of out of luck.

In the event that CDC can't be enabled, then I would follow @Tamanchu's approach probably weekly to clean up any deleted records, however any consumers of your data would need to understand that deleted records will take a week to actually fall out of the reporting.