Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now

Reply
bin_curato
Frequent Visitor

How to Resolve Duplicate Record Issues in Microsoft Fabric Pipeline with Copy Data Component?

Hi everyone,

I'm experiencing an issue with duplicate records when performing a Copy Data operation from an on-premise MSSQL database to my lakehouse in Microsoft Fabric.

Configuration details:

  • On-premise data gateway version: 3000.230.15 (July 2024).
  • Dynamic setup: The Copy Data component is configured with dynamic parameters, including an SQL query that selects data and adds a timestamp column for ingestion:
    sql
    Copia codice
    SELECT *, getdate() as edp_dat_ingestion FROM dbo.TABLE gc WITH (NOLOCK) WHERE FIELD_DAT >= '2009-12-31 23:59:59'
  • Issue: Although the original source table has no duplicates, records become duplicated in the Delta Table within the lakehouse destination. Additionally, the source table contains records with NULL values in the FIELD_DAT field used in the WHERE clause.

I've attached an image of the Copy Data component configuration for clarity.

CopyData-configuration.png

Has anyone encountered a similar issue or have suggestions on how to prevent record duplication?

20 REPLIES 20
v-nuoc-msft
Community Support
Community Support

Hi @bin_curato 

 

May I ask if your problem has been solved? Please let me know if there are any more problems.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

frithjof_v
Community Champion
Community Champion

I'm curious, in the Table action, why is Append being used instead of Overwrite?

 

Is the intention to append the data, or overwrite the data?

 

The SQL query returns all data on or after 2009-12-31 23:59:59 each time the pipeline gets run. So if you are using Append, you will accumulate duplicate records if you run the pipeline multiple times.

 

However, the edp_dat_ingestion (GETDATE()) column should be unique every time you run the pipeline. So, that column will mean that your records should not be truly duplicates.

 

Tbh I don't have so much SQL experience. I see you are using a NOLOCK hint. I don't have experience with it, but could that be a reason for getting duplicates?

 

"Additionally, the source table contains records with NULL values in the FIELD_DAT field used in the WHERE clause" 

 

Do the rows with NULL values also get ingested to your destination? If yes, that sounds strange, based on the SQL query.

 

You could check the delta table version history of your destination table inside Fabric. This way, you could check if it has been written to multiple times. 

 

You can use a notebook with

 

%%sql DESCRIBE HISTORY tableName;

 

or

 

from delta.tables import *

 

deltaTable = DeltaTable.forPath(spark, pathToTable)

 

fullHistoryDF = deltaTable.history() # get the full history of the table

 

to see all the versions of a table.

hi @frithjof_v , 

use append because before importing data from my sources, I create the table with the data type that I need, if I use overwrite the component creates the table again but the data types are not correct. 

the problem seems to actually read some extra records on the source that are then written on the lakehouse.

Hi @bin_curato,

Late to the Party but lets see if this narrows things down for you.

  • Add a Pre-Copy Script that Drops and Recreates the Table for you then performs the append operation
  • Will give you fresh Delta Sink in Bronze, hopefully removing any issues with the deletes that I think you are performing in the step before loading the data


I hope this helps,
Richard

Did I answer your question? Mark my post as a solution! Kudos Appreciated!

Proud to be a Super User!


Hi @richbenmintz ,
thanks for your support. already now I empty the temporary table and then go to feed the final table. 
the problem seems that when data is read from the source, some duplicates are generated.

 

Hi @bin_curato

We need to ensure that it is not the reading of the delta bronze table that is not creating the duplicates.

Your delete is creating a logged delete of the data, it still exists in the delta parquet store and is relying on the query engine to get the most up to date records. 

 

Dropping the table should remove the underlying files and allow you to start from scratch



I hope this helps,
Richard

Did I answer your question? Mark my post as a solution! Kudos Appreciated!

Proud to be a Super User!


hi @richbenmintz , 

I assure you that the table is dropped and then populated with data.
I will check your suggestion, but the problem if you look at the image in yesterday's answer to Srisakthi, 74.057.363 records are read from the source and the same number is written to the table.

How often does new data enter the source table (on the SQL server)?

 

E.g. once per day, once per hour, once per minute?

 

Do you get a different row count if you run the same query from SSMS or Data Pipeline (ref. number of rows read in the copy data details)?

 

I see that the copy activity takes 13 hours. I don't have a lot of experience with Data Pipeline. Is 13 hours a normal duration? I think it sounds long.

 

I don't know much about NOLOCK, which is used in the query. Could it be causing issues in a long running pipeline?

I see. Sounds strange.

 

I would double check by checking the version history of the table, ref. my previous comment. Just to see if something looks strange there as well.

 

Of course, could be a bug.

 

But I would do some checks, as mentioned above, to try to understand what's happening.

 

If it is not possible to find a reason, then I guess support ticket is the next option.

Srisakthi
Resolver II
Resolver II

Hi @bin_curato ,

 

Can you check the monitoring of the specific pipeline run to see how many times it has ran and how many records it read from source table and how many it has transferred. 

Also Enable logging and check the root cause.

 

Please have a look at this thread, it is similar to this issue

https://community.fabric.microsoft.com/t5/Data-Pipeline/On-Prem-Copy-Data-in-Data-Pipeline-Appending...

 

Thanks,

Srisakthi

 

Hi @Srisakthi , 

I am confident that my pipeline is only executed once. The number of duplicated records is not equivalent to the number of records extracted from your generating table.
Screenshot 2024-11-04 094429.png
To enable the logs you mean enabling in the setting of the "Copy data" component?

Hi @bin_curato ,

 

Yeah enable logging so that we can see the files transferred, based on that we could find where it went wrong.

 

Regards,

Srisakthi

Hi @Srisakthi ,

I will try to enable the logs. but I will have need of time because I must compare internally, then the table takes much mouse to be read and written in the lakehouse.

Hi @bin_curato ,

 

After enabling it better to have that table empty and load only that specific table. It will help to find issue easily. Also easy to share the pipeline run id with MS

 

Regards,

Srisakthi

OktayPamuk80
Resolver II
Resolver II

Hi,

Did you check one of the duplicates in your source?

Did you recreate the table and the copy activity?

Regards,
Oktay

 

Did I answer your question? Then please mark my post as the solution.

If I helped you, click on the Thumbs Up to give Kudos.

Hi, in my source I have no duplicate. my pipeline is part of an ingestion flow of more than 200 tables and I have not encountered similar issues in other tables. I also tried to run my pipeline passing parameters manually, but the problem remains.

I know I ask "stupud" questions, but can it be, that you are ingesting the same table twice? If your parameter includes same table 2 times in your list (or check the iteration if you have that, if the same table gets iterated twice)

Hi, all questions are welcome, because they can make you think to arrive at a solution. there are no iterations once the select is passed to the component. i also tried the sub pipeline to be sure.

OktayPamuk80
Resolver II
Resolver II

Hi, 

Do the duplicate records have the same timestamp (edp_dat_ingestion)? If yes, something very strange is going on. If they are different (or one is null) can it be, that you initially loaded and afterwards added the edp_dat_ingestion column?

Regards,
Oktay

 

Did I answer your question? Then please mark my post as the solution.

If I helped you, click on the Thumbs Up to give Kudos.

Hi OktayPamuk80,

the records are exactly identical, so I do not find an answer to my problem. I think the problem is caused by the "Copy data" component, but I would like to know if it has happened to others. My source table contains more than 74 million records, but this does not justify what happens.

Helpful resources

Announcements
November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Live Sessions with Fabric DB

Be one of the first to start using Fabric Databases

Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.

November Update

Fabric Monthly Update - November 2024

Check out the November 2024 Fabric update to learn about new features.

Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.