Re: How to Resolve Duplicate Record Issues in Micr...

bin_curato · ‎10-31-2024

Hi everyone,

I'm experiencing an issue with duplicate records when performing a Copy Data operation from an on-premise MSSQL database to my lakehouse in Microsoft Fabric.

Configuration details:

On-premise data gateway version: 3000.230.15 (July 2024).
Dynamic setup: The Copy Data component is configured with dynamic parameters, including an SQL query that selects data and adds a timestamp column for ingestion:
sql
Copia codice
SELECT *, getdate() as edp_dat_ingestion FROM dbo.TABLE gc WITH (NOLOCK) WHERE FIELD_DAT >= '2009-12-31 23:59:59'
Issue: Although the original source table has no duplicates, records become duplicated in the Delta Table within the lakehouse destination. Additionally, the source table contains records with NULL values in the FIELD_DAT field used in the WHERE clause.

I've attached an image of the Copy Data component configuration for clarity.

Has anyone encountered a similar issue or have suggestions on how to prevent record duplication?

v-nuoc-msft · ‎11-18-2024

Hi @bin_curato

May I ask if your problem has been solved? Please let me know if there are any more problems.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

frithjof_v · ‎11-04-2024

I'm curious, in the Table action, why is Append being used instead of Overwrite?

Is the intention to append the data, or overwrite the data?

The SQL query returns all data on or after 2009-12-31 23:59:59 each time the pipeline gets run. So if you are using Append, you will accumulate duplicate records if you run the pipeline multiple times.

However, the edp_dat_ingestion (GETDATE()) column should be unique every time you run the pipeline. So, that column will mean that your records should not be truly duplicates.

Tbh I don't have so much SQL experience. I see you are using a NOLOCK hint. I don't have experience with it, but could that be a reason for getting duplicates?

"Additionally, the source table contains records with NULL values in the FIELD_DAT field used in the WHERE clause"

Do the rows with NULL values also get ingested to your destination? If yes, that sounds strange, based on the SQL query.

You could check the delta table version history of your destination table inside Fabric. This way, you could check if it has been written to multiple times.

You can use a notebook with

%%sql DESCRIBE HISTORY tableName;

or

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)

fullHistoryDF = deltaTable.history() # get the full history of the table

to see all the versions of a table.

bin_curato · ‎11-04-2024

hi @frithjof_v ,

use append because before importing data from my sources, I create the table with the data type that I need, if I use overwrite the component creates the table again but the data types are not correct.

the problem seems to actually read some extra records on the source that are then written on the lakehouse.

richbenmintz · ‎11-05-2024

Hi @bin_curato,

Late to the Party but lets see if this narrows things down for you.

Add a Pre-Copy Script that Drops and Recreates the Table for you then performs the append operation
Will give you fresh Delta Sink in Bronze, hopefully removing any issues with the deletes that I think you are performing in the step before loading the data

I hope this helps,
Richard

Did I answer your question? Mark my post as a solution! Kudos Appreciated!

Proud to be a Super User!

bin_curato · ‎11-05-2024

Hi @richbenmintz ,
thanks for your support. already now I empty the temporary table and then go to feed the final table.
the problem seems that when data is read from the source, some duplicates are generated.

richbenmintz · ‎11-05-2024

Hi @bin_curato,

We need to ensure that it is not the reading of the delta bronze table that is not creating the duplicates.

Your delete is creating a logged delete of the data, it still exists in the delta parquet store and is relying on the query engine to get the most up to date records.

Dropping the table should remove the underlying files and allow you to start from scratch

I hope this helps,
Richard

Did I answer your question? Mark my post as a solution! Kudos Appreciated!

Proud to be a Super User!

bin_curato · ‎11-05-2024

hi @richbenmintz ,

I assure you that the table is dropped and then populated with data.
I will check your suggestion, but the problem if you look at the image in yesterday's answer to Srisakthi, 74.057.363 records are read from the source and the same number is written to the table.

frithjof_v · ‎11-05-2024

How often does new data enter the source table (on the SQL server)?

E.g. once per day, once per hour, once per minute?

Do you get a different row count if you run the same query from SSMS or Data Pipeline (ref. number of rows read in the copy data details)?

I see that the copy activity takes 13 hours. I don't have a lot of experience with Data Pipeline. Is 13 hours a normal duration? I think it sounds long.

I don't know much about NOLOCK, which is used in the query. Could it be causing issues in a long running pipeline?

frithjof_v · ‎11-04-2024

I see. Sounds strange.

I would double check by checking the version history of the table, ref. my previous comment. Just to see if something looks strange there as well.

Of course, could be a bug.

But I would do some checks, as mentioned above, to try to understand what's happening.

If it is not possible to find a reason, then I guess support ticket is the next option.

Srisakthi · ‎11-02-2024

Hi @bin_curato ,

Can you check the monitoring of the specific pipeline run to see how many times it has ran and how many records it read from source table and how many it has transferred.

Also Enable logging and check the root cause.

Please have a look at this thread, it is similar to this issue

https://community.fabric.microsoft.com/t5/Data-Pipeline/On-Prem-Copy-Data-in-Data-Pipeline-Appending...

Thanks,

Srisakthi

bin_curato · ‎11-04-2024

Hi @Srisakthi ,

I am confident that my pipeline is only executed once. The number of duplicated records is not equivalent to the number of records extracted from your generating table.

To enable the logs you mean enabling in the setting of the "Copy data" component?

Srisakthi · ‎11-04-2024

Hi @bin_curato ,

Yeah enable logging so that we can see the files transferred, based on that we could find where it went wrong.

Regards,

Srisakthi

bin_curato · ‎11-04-2024

Hi @Srisakthi ,

I will try to enable the logs. but I will have need of time because I must compare internally, then the table takes much mouse to be read and written in the lakehouse.

Srisakthi · ‎11-04-2024

Hi @bin_curato ,

After enabling it better to have that table empty and load only that specific table. It will help to find issue easily. Also easy to share the pipeline run id with MS

Regards,

Srisakthi

OktayPamuk80 · ‎10-31-2024

Hi,

Did you check one of the duplicates in your source?

Did you recreate the table and the copy activity?

Regards,
Oktay

Did I answer your question? Then please mark my post as the solution.

If I helped you, click on the Thumbs Up to give Kudos.

bin_curato · ‎10-31-2024

Hi, in my source I have no duplicate. my pipeline is part of an ingestion flow of more than 200 tables and I have not encountered similar issues in other tables. I also tried to run my pipeline passing parameters manually, but the problem remains.

OktayPamuk80 · ‎11-01-2024

I know I ask "stupud" questions, but can it be, that you are ingesting the same table twice? If your parameter includes same table 2 times in your list (or check the iteration if you have that, if the same table gets iterated twice)

bin_curato · ‎11-01-2024

Hi, all questions are welcome, because they can make you think to arrive at a solution. there are no iterations once the select is passed to the component. i also tried the sub pipeline to be sure.

OktayPamuk80 · ‎10-31-2024

Hi,

Do the duplicate records have the same timestamp (edp_dat_ingestion)? If yes, something very strange is going on. If they are different (or one is null) can it be, that you initially loaded and afterwards added the edp_dat_ingestion column?

Regards,
Oktay

Did I answer your question? Then please mark my post as the solution.

If I helped you, click on the Thumbs Up to give Kudos.

bin_curato · ‎10-31-2024

Hi OktayPamuk80,

the records are exactly identical, so I do not find an answer to my problem. I think the problem is caused by the "Copy data" component, but I would like to know if it has happened to others. My source table contains more than 74 million records, but this does not justify what happens.

How to Resolve Duplicate Record Issues in Microsoft Fabric Pipeline with Copy Data Component?

Configuration details:

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - February 2025

Fabric Community Update - February 2025

New Offer! Become a Certified Fabric Data Engineer

How to Resolve Duplicate Record Issues in Microsoft Fabric Pipeline with Copy Data Component?

Configuration details:

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - February 2025

Fabric Community Update - February 2025