Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
Hi everyone,
I'm experiencing an issue with duplicate records when performing a Copy Data operation from an on-premise MSSQL database to my lakehouse in Microsoft Fabric.
I've attached an image of the Copy Data component configuration for clarity.
Has anyone encountered a similar issue or have suggestions on how to prevent record duplication?
Hi @bin_curato
May I ask if your problem has been solved? Please let me know if there are any more problems.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
I'm curious, in the Table action, why is Append being used instead of Overwrite?
Is the intention to append the data, or overwrite the data?
The SQL query returns all data on or after 2009-12-31 23:59:59 each time the pipeline gets run. So if you are using Append, you will accumulate duplicate records if you run the pipeline multiple times.
However, the edp_dat_ingestion (GETDATE()) column should be unique every time you run the pipeline. So, that column will mean that your records should not be truly duplicates.
Tbh I don't have so much SQL experience. I see you are using a NOLOCK hint. I don't have experience with it, but could that be a reason for getting duplicates?
"Additionally, the source table contains records with NULL values in the FIELD_DAT field used in the WHERE clause"
Do the rows with NULL values also get ingested to your destination? If yes, that sounds strange, based on the SQL query.
You could check the delta table version history of your destination table inside Fabric. This way, you could check if it has been written to multiple times.
You can use a notebook with
%%sql DESCRIBE HISTORY tableName;
or
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
fullHistoryDF = deltaTable.history() # get the full history of the table
to see all the versions of a table.
hi @frithjof_v ,
use append because before importing data from my sources, I create the table with the data type that I need, if I use overwrite the component creates the table again but the data types are not correct.
the problem seems to actually read some extra records on the source that are then written on the lakehouse.
Hi @bin_curato,
Late to the Party but lets see if this narrows things down for you.
Proud to be a Super User!
Hi @richbenmintz ,
thanks for your support. already now I empty the temporary table and then go to feed the final table.
the problem seems that when data is read from the source, some duplicates are generated.
Hi @bin_curato,
We need to ensure that it is not the reading of the delta bronze table that is not creating the duplicates.
Your delete is creating a logged delete of the data, it still exists in the delta parquet store and is relying on the query engine to get the most up to date records.
Dropping the table should remove the underlying files and allow you to start from scratch
Proud to be a Super User!
hi @richbenmintz ,
I assure you that the table is dropped and then populated with data.
I will check your suggestion, but the problem if you look at the image in yesterday's answer to Srisakthi, 74.057.363 records are read from the source and the same number is written to the table.
How often does new data enter the source table (on the SQL server)?
E.g. once per day, once per hour, once per minute?
Do you get a different row count if you run the same query from SSMS or Data Pipeline (ref. number of rows read in the copy data details)?
I see that the copy activity takes 13 hours. I don't have a lot of experience with Data Pipeline. Is 13 hours a normal duration? I think it sounds long.
I don't know much about NOLOCK, which is used in the query. Could it be causing issues in a long running pipeline?
I see. Sounds strange.
I would double check by checking the version history of the table, ref. my previous comment. Just to see if something looks strange there as well.
Of course, could be a bug.
But I would do some checks, as mentioned above, to try to understand what's happening.
If it is not possible to find a reason, then I guess support ticket is the next option.
Hi @bin_curato ,
Can you check the monitoring of the specific pipeline run to see how many times it has ran and how many records it read from source table and how many it has transferred.
Also Enable logging and check the root cause.
Please have a look at this thread, it is similar to this issue
Thanks,
Srisakthi
Hi @Srisakthi ,
I am confident that my pipeline is only executed once. The number of duplicated records is not equivalent to the number of records extracted from your generating table.
To enable the logs you mean enabling in the setting of the "Copy data" component?
Hi @bin_curato ,
Yeah enable logging so that we can see the files transferred, based on that we could find where it went wrong.
Regards,
Srisakthi
Hi @Srisakthi ,
I will try to enable the logs. but I will have need of time because I must compare internally, then the table takes much mouse to be read and written in the lakehouse.
Hi @bin_curato ,
After enabling it better to have that table empty and load only that specific table. It will help to find issue easily. Also easy to share the pipeline run id with MS
Regards,
Srisakthi
Hi,
Did you check one of the duplicates in your source?
Did you recreate the table and the copy activity?
Regards,
Oktay
Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.
Hi, in my source I have no duplicate. my pipeline is part of an ingestion flow of more than 200 tables and I have not encountered similar issues in other tables. I also tried to run my pipeline passing parameters manually, but the problem remains.
I know I ask "stupud" questions, but can it be, that you are ingesting the same table twice? If your parameter includes same table 2 times in your list (or check the iteration if you have that, if the same table gets iterated twice)
Hi, all questions are welcome, because they can make you think to arrive at a solution. there are no iterations once the select is passed to the component. i also tried the sub pipeline to be sure.
Hi,
Do the duplicate records have the same timestamp (edp_dat_ingestion)? If yes, something very strange is going on. If they are different (or one is null) can it be, that you initially loaded and afterwards added the edp_dat_ingestion column?
Regards,
Oktay
Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.
Hi OktayPamuk80,
the records are exactly identical, so I do not find an answer to my problem. I think the problem is caused by the "Copy data" component, but I would like to know if it has happened to others. My source table contains more than 74 million records, but this does not justify what happens.
Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
Check out the November 2024 Fabric update to learn about new features.
User | Count |
---|---|
6 | |
5 | |
2 | |
2 | |
1 |
User | Count |
---|---|
19 | |
11 | |
7 | |
6 | |
6 |