<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Hard delete on a large table with incremental load in Fabric in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181387#M16181</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/875336"&gt;@mustaphaben&lt;/a&gt;,&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;The best option is to enable:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN class=""&gt;SQL Server CDC&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;or SQL Change Tracking&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Until then, the common Fabric approach is:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN class=""&gt;keep incremental upserts for inserts/updates;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;run a separate periodic reconciliation process for deletes;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;compare source business keys with the Lakehouse using a LEFT ANTI JOIN;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;delete orphaned rows from Silver (and Bronze if required).&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Useful docs:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;A title="About Change Tracking (SQL Server)" href="https://learn.microsoft.com/en-gb/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver17&amp;amp;utm_source" target="_self"&gt;About Change Tracking (SQL Server)&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&amp;nbsp;&lt;A title="What is change data capture (CDC)?" href="https://learn.microsoft.com/en-gb/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-ver17&amp;amp;utm_source" target="_self"&gt;What is change data capture (CDC)?&lt;/A&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 12 May 2026 07:52:12 GMT</pubDate>
    <dc:creator>Olufemi7</dc:creator>
    <dc:date>2026-05-12T07:52:12Z</dc:date>
    <item>
      <title>Hard delete on a large table with incremental load in Fabric</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181197#M16179</link>
      <description>&lt;P&gt;Hello community,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am looking for feedback on a data deletion handling scenario in Microsoft Fabric.&lt;/P&gt;&lt;P&gt;We have a source table with more than 140 million rows, loaded incrementally to capture newly inserted and updated records.&lt;BR /&gt;The issue is that some rows are deleted in the source system through hard delete.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Context constraints:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;no soft delete column;&lt;/LI&gt;&lt;LI&gt;no CDC enabled at this stage;&lt;/LI&gt;&lt;LI&gt;no technical primary key;&lt;/LI&gt;&lt;LI&gt;rows can be identified using a combination of 4 columns;&lt;/LI&gt;&lt;LI&gt;deletions need to be propagated to the Lakehouse, across both Bronze and Silver layers;&lt;/LI&gt;&lt;LI&gt;the goal is to avoid an approach that would be too expensive in terms of capacity on very large volumes.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;While waiting for feedback from the source team regarding a possible CDC activation (current source is SQL Server, future target is Azure SQL), what approach would you recommend in Fabric to identify deleted rows and keep the target aligned with the source?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would appreciate any concrete feedback, best practices, or patterns already used in this type of scenario.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Mon, 11 May 2026 21:55:57 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181197#M16179</guid>
      <dc:creator>mustaphaben</dc:creator>
      <dc:date>2026-05-11T21:55:57Z</dc:date>
    </item>
    <item>
      <title>Re: Hard delete on a large table with incremental load in Fabric</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181238#M16180</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/875336"&gt;@mustaphaben&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is a tricky one, because with a classic incremental load, inserts and updates are easy… but hard deletes are invisible unless the source explicitly tells you about them (CDC, soft delete flag, tombstone table, etc.).&lt;/P&gt;&lt;P&gt;With 140M+ rows, I would personally avoid any "compare the full table" approach because the capacity cost can become painful very quickly in Fabric.&lt;/P&gt;&lt;P&gt;What I would do as an interim pattern :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Keep the normal incremental MERGE for inserts/updates&lt;/LI&gt;&lt;LI&gt;Build a deterministic business key from the 4 identifying columns (preferably a hash)&lt;/LI&gt;&lt;LI&gt;Extract only the source keys (not the full rows)&lt;/LI&gt;&lt;LI&gt;Run a periodic "delete sweep" using an anti-join&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Something along these lines :&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;EM&gt;source_keys = SELECT col1, col2, col3, col4 FROM source&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Then:&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;EM&gt;Silver_delta&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;EM&gt;ANTI JOIN&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;EM&gt;source_keys&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;The result gives you the rows that still exist in the Lakehouse but no longer exist in the source system.&lt;/P&gt;&lt;P&gt;Then you can propagate the delete with a Delta MERGE :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;MERGE INTO silver t&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;EM&gt;USING deleted_keys d ON t.hash_key = d.hash_key&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;&lt;EM&gt;WHEN MATCHED THEN DELETE&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;For the key itself, I’d probably use something deterministic like :&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;sh(concat_ws('|', col1, col2, col3, col4), 256)&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;A few additional thoughts :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I would avoid scanning the entire table daily if possible&lt;/LI&gt;&lt;LI&gt;If the data can be partitioned (business date, ingestion window, etc.), only sweep relevant partitions*&lt;/LI&gt;&lt;LI&gt;Depending on governance/audit requirements, Bronze may stay append-only with delete tombstones (is_deleted) while physical deletes happen in Silver only&lt;/LI&gt;&lt;LI&gt;After large MERGEs/deletes, OPTIMIZE + V-Order can help keep Delta performance healthy in Fabric&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Honestly, though, I would still push strongly for CDC on the source side (SQL Server CDC / Azure SQL CDC). That’s the real long-term solution here because deletes become actual events instead of inferred differences.&lt;/P&gt;&lt;P&gt;Interesting topic by the way this is one of those scenarios where the architecture decisions matter much more than the SQL itself &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope this helps a bit &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;.&amp;nbsp;Don’t hesitate to share your feedback or the solution you finally went with.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Docs :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-and-delta-tables" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-and-delta-tables&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparksql" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://docs.delta.io/delta-update/#upsert-into-a-table-using-merge" target="_blank" rel="noopener"&gt;https://docs.delta.io/delta-update/#upsert-into-a-table-using-merge&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-factory/cdc-copy-job" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/fabric/data-factory/cdc-copy-job&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://docs.delta.io/best-practices/" target="_blank" rel="noopener"&gt;https://docs.delta.io/best-practices/&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;A href="https://docs.delta.io/optimizations-oss/" target="_blank" rel="noopener"&gt;https://docs.delta.io/optimizations-oss/&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 12 May 2026 01:24:37 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181238#M16180</guid>
      <dc:creator>Tamanchu</dc:creator>
      <dc:date>2026-05-12T01:24:37Z</dc:date>
    </item>
    <item>
      <title>Re: Hard delete on a large table with incremental load in Fabric</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181387#M16181</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/875336"&gt;@mustaphaben&lt;/a&gt;,&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Without CDC, Change Tracking, or a delete flag, there is no efficient way to detect hard deletes incrementally on a 140M+ table.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;The best option is to enable:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN class=""&gt;SQL Server CDC&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;or SQL Change Tracking&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Until then, the common Fabric approach is:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN class=""&gt;keep incremental upserts for inserts/updates;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;run a separate periodic reconciliation process for deletes;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;compare source business keys with the Lakehouse using a LEFT ANTI JOIN;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN class=""&gt;delete orphaned rows from Silver (and Bronze if required).&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Since you already have a 4-column business key, consider creating a hash key to simplify joins and improve Spark performance.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Also avoid full-table delete checks on every incremental load. Run reconciliation separately (for example nightly or weekly) to reduce capacity usage.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;Useful docs:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;A title="About Change Tracking (SQL Server)" href="https://learn.microsoft.com/en-gb/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver17&amp;amp;utm_source" target="_self"&gt;About Change Tracking (SQL Server)&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&amp;nbsp;&lt;A title="What is change data capture (CDC)?" href="https://learn.microsoft.com/en-gb/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-ver17&amp;amp;utm_source" target="_self"&gt;What is change data capture (CDC)?&lt;/A&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 May 2026 07:52:12 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181387#M16181</guid>
      <dc:creator>Olufemi7</dc:creator>
      <dc:date>2026-05-12T07:52:12Z</dc:date>
    </item>
    <item>
      <title>Re: Hard delete on a large table with incremental load in Fabric</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181545#M16192</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/875336"&gt;@mustaphaben&lt;/a&gt;,&amp;nbsp;&lt;BR /&gt;I would push your team hard for that CDC to be enabled. Without it, you're sort of out of luck.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In the event that CDC can't be enabled, then I would follow&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/1357760"&gt;@Tamanchu&lt;/a&gt;'s approach probably weekly to clean up any deleted records, however any consumers of your data would need to understand that deleted records will take a week to actually fall out of the reporting.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 May 2026 13:49:41 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Hard-delete-on-a-large-table-with-incremental-load-in-Fabric/m-p/5181545#M16192</guid>
      <dc:creator>tayloramy</dc:creator>
      <dc:date>2026-05-12T13:49:41Z</dc:date>
    </item>
  </channel>
</rss>

