DataflowGen2 in a Data Pipeline is intermittently ...

kallens · ‎12-13-2024

Hello,

I am building a Data Pipeline to update my table in a data warehouse with a rolling 31 days of data.

The data pipeline deletes the last 31 days of data
Then it calls on a DataFlow Gen 2 from power query to bring in 31 days of new data and append it to the table where the data was just deleted
- The flow has an anti-join in place so that it checks for dates already in the table and doesn't insert data if the date is already present
The problem I am having is that sometimes the Data Pipeline will insert double the amount of data in the DataFlow Gen 2 but this is not happening consistently

For example, the flow ran last night on a schedule and inserted 75,123 rows of data (green arrow)- which is double what it should be. When I reran the flow this morning, it correctly inserted 38,171 rows of new data (pink arrow).

EDIT TO ADD: the previous night, the data inserted just fine, not duplicated!

This only happens when the Dataflow gen 2 is run within the Data Pipeline - when I run the flow manually in Power Query it works fine.

I want to use the Data pipeline so that it only inserts data when data has successfully been deleted. Any idea as to why this happening?

kallens · ‎01-28-2025

Update: I was never able to resolve why this was happening and so I rebuilt the pipeline model differently to do a replacement on a separate table in the warehouse and then append from there. Instead of trying to append directly from the Power Query with the API call. I have not received duplicate data with this method. I think it had something to do with the Append statement.

Anonymous · ‎12-26-2024

Hi @kallens ,

What are the results of your API calls to import data into Power Query?

If the problem persists, please provide the relevant screenshot information with a description and I'll get back to you as soon as possible.

Best regards,

Adamk Kong

kallens · ‎01-07-2025

Thanks Adamk. Right now the results continue to be inconsistent in Fabric via the Pipeline and the Power Query, and when I check my API call it seems to produce the correct # of rows aka the unduplicated count. I am using Supermetrics to create and generate the API query and results. Here are some screenshots with supoprting documentation.

When I run them in Supermetrics Query Manager the results are typically around ~40k rows and ~190 rows respectively (see slides 5&6)
When I put it into Power Query and run results in Power Query to get a row count it’s the same, around ~40k rows (slide 2)
In my Power Query I have a left anti-join in the Power Query that checks for dates already in my destination table so that it doesn’t add in any dates that already exist (this is more of a failsafe)
It’s set to append to a table I have set up in my Microsoft Fabric data warehouse
I also have a Data Pipeline in place that then deletes the last 31 days worth of data from the table each night and then upon that success, it calls on the Power Query data flow to insert the new latest 31 days of data into the table (slide 7)
I currently have the Data Pipeline set to run at 1AM Pacific Time each night
The problem I am experiencing: sometimes (intermittently, not every time) the data added to my table from the Power Query flow is exactly duplicated

For example, last night the row count added to Power Query was 86,262 – which I suspected was duplicated
When I re-ran the Data Pipeline manually this morning it successfully deleted those ~86k rows and then inserted the correct amount of rows which is 43,131 – exactly half of what was inserted last night

I know that my pipeline or flow isn’t running twice or concurrently because I have the left-anti join set up to not insert any dates that already exist into the table
I also tested having the Power Query run on its own every night independent of the Data Pipeline and still got duplicate results

So far I haven’t been able to recreate this duplication when I manually run the data query /pipeline myself during the day manually. It seems to only happen on my overnight schedules. Is there something about the time at which I am running it that could be impacting the data coming through twice?

Do you have any other hypotheses as to why this could be happening? I have set my Data Pipeline to run again today on a schedule to see if time impacts it. And to see if it’s always when it’s from a schedule or if I can get it to duplicate when I manually trigger it.

I appreciate your help and time and anything you can suggest for me to try and test!

kallens · ‎12-17-2024

i stand corrected that the only time the data is duplicating from the Power Query dataflow gen 2 is in the pipeline. I have had it running independently on a schedule outside the data pipeline and got duplicated results the last 2 nights as well. 😞

I have a measure in place to not insert data into the table if the date already exists, so it shouldn't be loading in twice. I might need to check the results coming from the API call I am using to get the data into the power query.

lbendlin · ‎12-13-2024

what do you consider "today" and "night"? Wonder if there are timezones other than UTC involved.

kallens · ‎12-13-2024

I have the data pipeline set to run at 1 AM Pacific Time Zone (that is the 'night') and then 'today' was when I ran it around 10:23 AM Pacific Time.

what is strange is that the previous night, on it's scheduled run, the data pipeline ran as expected and inserted the correct amount of data, no duplicates. and no material changes to the pipeline between those times.

DataflowGen2 in a Data Pipeline is intermittently inserting data twice despite uniqueness measures

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - July 2025

Join us at FabCon Vienna from September 15-18, 2025

DataflowGen2 in a Data Pipeline is intermittently inserting data twice despite uniqueness measures

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - July 2025