topic Re: Update table with SQL MERGE issue duplicates: help needed with combining SQL deduplicate & u in Data Engineering

Update table with SQL MERGE issue duplicates: help needed with combining SQL deduplicate & update

Vogels — Thu, 08 Feb 2024 11:42:52 GMT

I have a really large table that I like to update with only the new or changed entries.

I have two tables (one with all data & one with the new to be updated data) and a basic code similar to (Upsert into a Delta Lake table using merge - Azure Databricks | Microsoft Learn) which ran fine for a few days. But then started to give the error: Cannot perform MERGE as multiple source rows matched and attempted to update the same target row in the Delta table.

I checked the tables and there are really no duplicates in either of them.
I completely deleted the table with the new to be updated data from the lakehouse, copied the data back into the lakehouse and then it worked again.
Therefore I think it has something to with duplicated log records, because I found this: A common ETL use case is to collect logs into Delta table by appending them to a table. However, often the sources can generate duplicate log records and downstream deduplication steps are needed to take care of them. (Data deduplication when writing into delta table - Azure Databricks | Microsoft Learn)

So I was thinking to merge our code for updating the large table with the code for deduplicating and then I should be good. However Im not so skilled in SQL yet and can't figure out how to combine them. Is someone able to help? Below our simplified code and the deduplicate example.

Simplified Code:

MERGE INTO Large_Table AS target

USING Large_Table_Previous_Days AS source

ON (target.ID = source.ID)

WHEN MATCHED THEN

UPDATE SET

target.LOAD_DATE = source.LOAD_DATE,

target.ID = source.ID,

target.Content = source.Content

WHEN NOT MATCHED THEN

INSERT (LOAD_DATE, ID, Content)

VALUES (source.LOAD_DATE, source.ID, source.Content)

Deduplicate example from microsoft learn:

MERGE INTO logs

USING newDedupedLogs

ON logs.uniqueId = newDedupedLogs.uniqueId

WHEN NOT MATCHED

THEN INSERT *

Re: Update table with SQL MERGE issue duplicates: help needed with combining SQL deduplicate & u

Anonymous — Fri, 09 Feb 2024 10:51:33 GMT

Hi @Vogels
Thanks for using Fabric Community.
You are doing right.

Instead of copying and deleting the table, perform deduplication onLarge_Table_Previous_Daysbefore merging. This ensures unique rows are fed into the MERGE statement. You can achieve this using the ROW_NUMBER function to assign a unique sequence number based on a specific order:

-- Deduplicate Large_Table_Previous_Days SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY LOAD_DATE DESC) AS row_num FROM Large_Table_Previous_Days -- Use only the first row (latest LOAD_DATE) for each ID in the merge WHERE row_num = 1

Modify your MERGE statement to use the deduplicated Large_Table_Previous_Days as the source:

-- Deduplicate Large_Table_Previous_Days SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY LOAD_DATE DESC) AS row_num FROM Large_Table_Previous_Days WHERE row_num = 1 -- Merge deduplicated data into Large_Table MERGE INTO Large_Table AS target USING ( -- Deduplicated result from the previous step ) AS source ON (target.ID = source.ID) WHEN MATCHED THEN UPDATE SET target.LOAD_DATE = source.LOAD_DATE, target.ID = source.ID, target.Content = source.Content WHEN NOT MATCHED THEN INSERT (LOAD_DATE, ID, Content) VALUES (source.LOAD_DATE, source.ID, source.Content)

Replace Large_Table, Large_Table_Previous_Days, and the column names (ID, LOAD_DATE, Content) with your actual table and column names.

This code performs the following steps:

Deduplicates Large_Table_Previous_Days:
- It assigns a unique row_num to each row within each group of the same ID, ordered by descending LOAD_DATE.
- Only the first row (latest LOAD_DATE) for each ID is kept, ensuring unique data.
Merges deduplicated data:
- The deduplicated result is used as the source in the MERGE statement.
- ON clause matches rows based on the same ID in both tables.
- WHEN MATCHED updates existing rows in Large_Table.
- WHEN NOT MATCHED inserts new rows from the source.

This combined code should prevent duplicate row errors and ensure only new or changed entries are updated in your large table.

While this is one approach , if you want to perform an upsert (update or insert) operation on a table, you can use Dataflow Gen2 here. Incrementally amassing data in a data destination requires a technique to load only new or updated data into your data destination. This technique can be done by using a query to filter the data based on the data destination. The source will be the Large_Table_Previous_Days and the target table will be Large_Table. For more information refer to this document:
https://learn.microsoft.com/en-us/fabric/data-factory/tutorial-setup-incremental-refresh-with-dataflows-gen2

Hope this helps. Please let us know if you have any further queries.

Re: Update table with SQL MERGE issue duplicates: help needed with combining SQL deduplicate & u

Anonymous — Tue, 13 Feb 2024 04:18:22 GMT

Hi @Vogels
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. Otherwise, will respond back with the more details and we will try to help.
Thanks

Re: Update table with SQL MERGE issue duplicates: help needed with combining SQL deduplicate & u

Anonymous — Wed, 14 Feb 2024 16:52:11 GMT

Re: Update table with SQL MERGE issue duplicates: help needed with combining SQL deduplicate & u

Vogels — Mon, 19 Feb 2024 10:31:56 GMT

Hi @Anonymous,

Thank you very much for the detailed answer. It works perfectly now.