Re: Lakehouse data not keeping pace with pipeline ... - Page 2

EcurieArgyll · ‎07-29-2024

This morning I was contacted by the business advising me that a report had not refreshed. I dived in and checked all the run records. Everything ran on time. No errors.

Upon further investigation, the data had been fetched from the source system (on-premises SQL Server) into my Bronze lakehouse and had made it to the Silver Lakehouse, but the refresh into the Gold lakehouse had not picked up the new data. It had run and reported no errors, but the Silver Lakehouse had somehow not been up to date at the time that the extract into Gold had executed.

It was as if a transaction had not committed.

By the time I dived in, the Silver-to-Gold dataflow was seeing the correct data, so all I had to do was refresh a few dataflows but I do not understand how this could happen.

To be clear, these are dataflow gen2 steps against Lakehouses in a single data pipeline. Each step is run on success of the previous. So, for example, the Bronze-to-silver dataflow gen2 runs, works, completes with success, but the subsequent Silver-to-Gold dataflow gen2, on this occasion, did not see the up to date data. It did see it some minutes later.

How can this be? How can the pipeline initiate the execution of the subsequent step on-success before the data is "settled" for want of a better term, in the lakehouse?

This is an infrequent issue but enough to worry the business.

Do I have a fundamental misunderstanding of how these artefacts work?

EcurieArgyll · ‎07-29-2024

Thanks but my M code looks like your earlier example. "LakehouseId".....", "ItemKind = "Table"" etc.

These are extremely small amounts of data, looking at jobs booked for a day (low hundreds) and which vehicles or drivers may be unavailable (less than 10), so I'm confident I'm not stressing anything.

frithjof_v · ‎07-29-2024

I don't have other suggestions for what may be the cause.

I haven't got a lot of experience with data pipelines with multiple dataflows gen2 in series.

But I think it sounds like a normal way to orchestrate the data transformations through the medallion structure.

So I share your concern about this.

Have you been able to reproduce the issue or has it just been a one-off? Maybe it could be solved by using a wait activity in the pipeline. Then again, I don't know how many seconds/minutes to wait.

Hope someone else can help on this topic!

EcurieArgyll · ‎07-29-2024

It isn't a one-off but I can't systematically reproduce it. It is infrequent.

How long to wait would indeed be the question.

I'm very grateful for your input and like you I hope we see some more opinions on this.

frithjof_v · ‎07-29-2024

Btw - and I'm not sure if this has anything to to with it - but is the dataflow doing any table schema changes on your table (adding, renaming or removing columns), or is it only updating the data?

And what settings do you have here:

Maybe some of these settings may impact how long time it takes before the table is ready for querying in the Lakehouse.

(However anyway I think that the data pipeline should not mark an activity as completed before the table is ready to get queried by the next step).

EcurieArgyll · ‎07-29-2024

I'm not making any schema changes in these dataflows. The settings are "Auto mapping and allow for schema change", but the columns are static. I have noticed, during development, that metadata changes can take some persuasion to make it to all the parts they need to reach so I have avoided being that clever in normal production thus far.

frithjof_v · ‎07-29-2024

I see.

So this one: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-data-destinations-and-managed-se...

The docs say that the table is being dropped and recreated on each dataflow refresh.

I did a test about that and I don't think the table is actually being dropped and recreated. I described it here: Re: Dataflow Gen2 - Table not getting dropped and ... - Microsoft Fabric Community

Have you tried using a Notebook to check the version history of your table, to see if you can find out what time the ReplaceTable and Update operations took place. If there are any deviations in the timestamps at the time when your dataflow refresh ran, they might help in understanding the issue that occured.

EcurieArgyll · ‎07-29-2024

Thanks for that. Very useful.

More than one pipeline "failed" in this way today.

I have realised that one of the misbehaving operations is actually a notebook that deletes clashing incoming IDs from the target in Silver followed by a simple append using a pipeline Copy activity from Bronze to Silver, so the table wouldn't be recreated. In any case the timestamp indicates that the update finished just under a minute before the subsequent dataflow ran and didn't find the data in Silver.

Another one is a dataflow gen2 performing a Replace. I have checked both using the SQL history command you have kindly pointed out and they both finished the subsequent step just under one minute before the step that didn't read the new data. Both wrote into the Silver lakehouse in their different ways. The simpler dataflow replace operation was the second to run, but both sets of new data (in different tables, one appended and one replaced) were unseen by the dataflows that subsequently ran to pick up from Silver and move to Gold.

It is as if the Silver lakehouse just didn't want to be rushed into giving up it's new data quite so soon.

frithjof_v · ‎07-29-2024

Interesting!

Then I don't understand why the SilverToGold Dataflow Gen2 fails to pick up the new data. According to the timings which you mention, the new data should already be in Silver almost one minute before running the SilverToGold Dataflow Gen2, right?

Does the SilverToGold Dataflow Gen2 replace or append data into Gold?

In the "failed" runs, does the SilverToGold Dataflow Gen2 pick up data (i.e. old data) from the silver table, or doesn't it pick up any data from the silver table (i.e. it thinks the table is empty)?

I.e. does it seem like the SilverToGold dataflow sees the old data in the Silver table, or does it seem like the SilverToGold sees an empty table when it looks at the Silver table?

Anyway, I hope someone else can make sense of this and explain why this is happening.

Or I guess you will need to create a support ticket.

If you have thought about a workaround, i.e. insert a wait activity, I guess I would have tried that.

EcurieArgyll · ‎07-29-2024

The dataflows that populate Gold do so on a replace basis. This morning they definitely found old data (the report (and Gold Lakehouse) showed data from Friday when the pipeline last ran) rather than empty tables.

Lakehouse data not keeping pace with pipeline steps

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

FabCon is coming to Atlanta

Lakehouse data not keeping pace with pipeline steps

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026