Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified as a Fabric Data Engineer: Check your eligibility for a 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700. Get started

Reply
alozovoy
Advocate II
Advocate II

Duplicated Rows In Tables Built By Notebook

I was seeing incorrect numbers after joining two tables together. I have isolated the issue as shown in the following example:

 

I take the existing delta table named Rainbow_Accounting_Adjustments, and I save it as a dataframe using a temporary view. I then save the dataframe as a delta table named Business_Unit_Adjustment_Totals.

 

image.png

 

Business_Unit_Adjustment_Totals should be a direct copy of Rainbow_Accounting_Adjustments. Instead, there is a duplicate of each row.

 

image (1).png

 

Rainbow_Accounting_Adjustments is updated by Dataflow Gen 2. I have noticed similar problems with other tables created/updated by Dataflow Gen 2. Sometimes it is not an exact duplicate, but shows old data from a previous import.

 

Any advice on how to resolve this issue would be appreciated!

20 REPLIES 20
frithjof_v
Super User
Super User

I created an idea: https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=cea60fc6-93f2-ee11-a73e-6045bd7cb2b6 

 

Please vote if you want this issue to be fixed

amaaiia
Super User
Super User

Same issue, any solution?

In my case the issue resoled itself when the dataflow re-ran.  (In my case at least) it looks like the dataflow failed to "version" the Delta Table in the lakehouse, and for some reason only Notebooks then return "multiple versions" of the table (the new data AND the old data).

 

Re-running the dataflow might resolve your issue, or if all else fails, delete the table in the lakehouse manually and re-run the dataflow.

 

MS was not able to provide me with a solution or explaination since by the time they got in touch with me the issue had resolved itself (the next day).

In Feb 19th I got same issue with another table (check my post), and what I did is what you are saying, delete manually and start again. But as this has happened again, I think it should be a solution by MS side. In my case, I didn't have any issue when overwriting the table through DFg2 as you mentioned you had with your 6 tables, all my tables where succesfully loaded with no errors in DFg2 publishing, so I can't understand why is this happening.

 

I guess I'm creating new support ticket...

Rather than manually deleting, you can set up a notebook to run a delete script each time before your dataflow executes. Here's an example I have that runs nightly before my dataflows:

 

alozovoy_0-1712243757516.png

 

I'm curious, will this work if you are using this table in a relationship in a Semantic model?

 

I would assume any relationships would get broken? (Or that the table cannot be dropped if there is a relationship)

 

Or if there is a OneLake shortcut pointing to this table, I would assume that any shortcut would get broken when dropping the table?

 

Anyone knows / have tried?

GregMarbais
Kudo Collector
Kudo Collector

I'm seeing the same thing

frithjof_v
Super User
Super User

I was able to reproduce a similar issue.
I created a table (TestData) in a Lakehouse by using a dataflow gen2.

Afterwards, I used the same dataflow gen2 to update the data in this table (TestData).
I also created another dataflow gen2 which I also used to write som data to the same table (TestData).
I also switched between append and replace mode.

The last few refreshes have been only replace mode.

(Switching between append and replace, and having two dataflow gen2 which are updating the same table, may be an edge case 😊 It's probably not something I am going to be doing a lot)

 

If I query my table now, it looks like this from the various interfaces:

SQL Analytics Endpoint:

frithjof_v_3-1706131193129.png

 

 

Lakehouse explorer:

 

frithjof_v_1-1706130970204.png

 

 

Notebook: 

(Here, surprisingly, some extra rows are present. These rows were inserted in a previous write to the table, and should not be included in the current version of the table.)

 

frithjof_v_0-1706132347349.png

 

Yes, looks like you were able to reproduce it!

 

I am in contact with Microsoft support, so this is very helpful to have a second example from a different user!

Hi @alozovoy 

 

Thanks for using Microsoft Fabric Community.

Apologies for the delay in response.

Could you please provide the support ticket number as it would help us to track for more information.

Just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.


Thanks.

The support ticket is #2401240010003437. It has not yet been resolved.

I'm experiencing a similar issue:
My dataflow Gen2 is storing data in lakehouse. Once in a while the 'replace' table setting in the dataflow Gen2 doesn't seem to work and it results in having the same data copied twice. It only seems to be affecting smaller tables.

I believe to have identified the root cause: In some cases my DataflowGen2 would fail with one table. For 6 of my tables the replace load was completed without issues in the same flow. However, at the 'Write' step for the last table it would fail causing the full dataflow Gen 2 to fail (with 6 tables completely loaded and 1 failed).

In the above scenario those 6 tables would somehow have persisted duplicates when I queried it via Lakehouse. I could only get rid of them by dropping the table and reloading the dataflow. As long as my dataflow gen 2 doesn't fail, I don't get the duplicate issues it seems.

However as alozovoy pointed out, running a notebook dropping all the tables prior to the replace seems the best robust approach for now untill Microsoft fixes this quirck 

frithjof_v
Super User
Super User

If you run this code in Notebook:

 

%%sql

SELECT COUNT(*) FROM Rainbow_Accounting_Adjustments

 

 

And this code in SQL Analytics Endpoint:

 

SELECT COUNT(*) FROM Rainbow_Accounting_Adjustments

 

 

You get the double number in Notebook? 

 

I've created a workaround by using a separate notebook to drop the table each time before the Dataflow runs, createing a new table. Because of this I can't reproduce the problem at the moment.

frithjof_v
Super User
Super User

You mention that you use dataflow gen2 to update the data in the table.

 

Do you use replace or append in the Dataflow gen2 destination settings?

 

If you are using append, then it would be expected to see extra rows.

 

If you are using replace, then I agree it should not be duplicates in the destination table. Then I would create a support ticket.

It is set to replace.

 

Regardless, that should only impact the creation of the first table (Rainbow_Accounting_Adjustments). As the first query shows, this table appears to import correctly (at least when using the SQL endpoint).

 

However when I use this table in a notebook it has duplicate rows.

 

Just to clarify:

Excel   ----(Dataflow Gen2)--->  Rainbow_Accounting_Adjustments  ---(Notebook)--->  Business_Unit_Adjustment_Totals

I see

 

So if you query the Rainbow_Accounting_Adjustments from SQL analytics endpoint, you get the right number of rows?

 

And if you query Rainbow_Accounting_Adjustments from Notebook, you get the double number of rows?

 

I think I experienced the same thing myself, if I remember correctly. But I deleted my Lakehouse and dataflow gen2 afterwards. If you still have your Lakehouse and dataflow gen2, I encourage you to raise a support ticket so that support can have a look at it

I have opened a Support Ticket.

 

Thank You!

Hi @alozovoy ,

 

Any chance you got a resolution on your ticket?  I am having the same issue:  SQL Endpoint is giving me the correct number of rows; but Notebooks return multiple versions of records (started for me yesterday).

 

Thank you

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Feb2025 Sticker Challenge

Join our Community Sticker Challenge 2025

If you love stickers, then you will definitely want to check out our Community Sticker Challenge!

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

Feb2025 NL Carousel

Fabric Community Update - February 2025

Find out what's new and trending in the Fabric community.