Re: Duplicate rows via a staging table

ToddChitt · ‎01-28-2025

This solution involves quite a few different technologies and services from Micorsoft, but I thought this was the most appropriate forum for the question.

I have an on-premise source system that is SQL Server.

I have a Fabric Warehouse where I wish to have an exact copy of the source.

This solution was put in place long before both the Azure SQL Database in Fabric, and the Copy Job. The client does NOT wish to have us re-work the solution as it would be extensive effort.

The current solution uses Azure Data Factory (NOT data factory in Fabric) and the Metadata-driven copy task (wizard) to move the data. There is a Control table (in an Azure SQL db, NOT Fabric). For a FULL load, the destination table in the warehouse is truncated and re-loaded. For a DELTA load, changed data, according to a watermark column, is loaded into a staging table in a Fabric lakehouse. Then a stored procedure is run that does the manual operation of an UPSERT, in that it does an UPDATE to the destination table based on a match of the KEY column, then does an INSERT from the staging into the destination table, again JOIN on the key column, WHERE the row is not present in the destination. Basicly it looks like this:

The problem is that every once in a while (like every couple of weeks) we get several thousand duplicated rows in the destination table in the Warehouse. In this last case, there were no duplicated rows in the Lakehouse staging table. So how am I getting duplicated rows in the destination table?

Any ideas would be helpful.

Thanks in advance.

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

ToddChitt

I can say without any doubt that there is NO overlap between the FULL and DELTA loads. The former happens ONLY on Sundays mornings, t)akes about 90 minutes) and the DELTA happens Monday through Saturday evenings (and takes about 20 to 30 minutes).

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

LewisData

I have the exact same issue occuring in my Fabric, in this case I am using a Fabric Warehouse so there is no key constraint present in the source/destination. The Stored Procedure does largely the same thing as yours but also does a ROW_NUMBER() operation that increments when a new row is added, which happens only when the key is missing via WHERE NOT EXISTS. When I group by this Key I get 2x of everything!

The conclusion must surely be that the Stored Procedure somehow ran twice within a short time-frame and then inserted both sets of data even though they were duplicate. This is evident by the fact that this merge proc is quite large on that initial load and takes over 10 minutes, yet on the table the duplicated inserts are within 5 minutes of eachother.

Notably in my case, this only happens on the initial load of the biggest table (Over 100 Million rows) in the Data Warehouse, after that point the process appears to behave itself and only inserts one set of data.

LewisData

Upon further inspection I think I have found the root cause of my issue at least, I think by an unhappy coincidence my full-load did collide with a scheduled refresh. I would double/triple check that your FULL LOAD isn't also running at the same time as your DELTAs, or another scheduled process is being kicked off at the same time somewhere else.

Will do a full reload tonight and just double check that this is what caused it on my end.

ToddChitt

Hello @v-vpabbu and thank you for the support link. I will keep it in mind.

I can verify that the FULL and DELTA loads are NOT overlapping. FULL happens on Sundays, DELTA happens Monday through Saturday. Times are roughly 1 hour and 20 minutes, respectively, with multiple tables involved in each load.

I think the thing to do is set up an additional set of Staging Table in the Lakehouse, Destination table in the Warehouse, 2 additional rows in the Control Table for FULL and DELTA loads, and a totally new set of pipelines to handle just this one-off set of tables. That way, if I get duplicates, I can halt any additional loading until Microsoft Support can examine the Run ID of the pipelines that generated the duplicates.

If I get a resolution, I will try to post back here, but no promises.

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

v-vpabbu

Hi @ToddChitt,

Thank you for reaching out to Microsoft Fabric Community Forum.

Verify whether the FULL and INCREMENTAL loads are overlapping or interfering with each other. If the FULL load takes longer than expected, an INCREMENTAL load might begin while the table is still being truncated and reloaded, potentially causing duplicates.

If there is no overlap and the issue persists, I recommend raising a Microsoft Fabric support ticket for further investigation.

How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

Regards,

Vinay Pabbu

ToddChitt · ‎01-28-2025

I know that in Microsoft SQL Server, NULL <> NULL (Meaning: If you have a value of NULL in BOTH corresponding fields of a JOIN, you will NOT get a match.) I am sort of assuming that Fabric follows that same rule.

But that key column is not null. Anywhere.

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

FabianSchut · ‎01-28-2025

I agree with you that the join will not let NULL values pass the criteria. However, it could have ended up in the target with a FULL load, which then potentielly creates duplicates in the upsert logica after a new INCREMENTAL load is initiated.

ToddChitt

No. While I do not have access to inspect the metadata of the source table, it is a SQL Server database, with that column being the UNIQUE KEY for the table, I suspect it is also defined as NOT NULL..

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

FabianSchut

Perfect! Then I'm out of options for now

FabianSchut · ‎01-28-2025

Hi,

Are there any parallel tasks executed that somehow could interfer with another task that is running? For example a variable that is overwritten? Furthermore, you mention that there are no duplicate rows. Do you check only the key field for that or the whole row? In other words, is it possible that the row is unique, but the key is duplicate? That could mess up your joins and create duplicates.

ToddChitt · ‎01-28-2025

Hello @FabianSchut and thank you for your prompt reply. To answer your questions:

There are no parallel processes taking place. We have been careful to space out the triggers and schedules so that there are no overlaps. For sure there is nothing else WRITING or INSERTING into this table at that time.

The key field is named [rowpointer] and is a GUID. If I do a GROUP BY [rowpointer] HAVING COUNT(*) > 1 on the staging table, I get zero instances. If I do the same on the destination table, I get 6K+ pairs. The stored procedure that does the UPDATE/INSERT process has JOIN ON that [rowpointer] field. There are no other joins to any other tables during the INSERT that could potentially cause duplicated records.

I check only the key field, not the entire row. Everying keys off that key field. It is UNIQUE in the source SQL database.

The ONLY thing I can see, and this is a BIG stretch, is that the Azure Data Factory task that does the MERGE procedure has a Retry of 1 iteration, set to 30 seconds intervals. I have set that to 0 for now.

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

FabianSchut · ‎01-28-2025

And what about the possibility that a record is inserted that has a NULL value as key column? Is there any check on the key column that is should not contain NULL values? Looking at your upsert script, that could potentially create duplicates too.

Duplicate rows via a staging table

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

New Offer! Become a Certified Fabric Data Engineer

Duplicate rows via a staging table

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025

Fabric Monthly Update - January 2025