Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
EcurieArgyll
Helper I
Helper I

Lakehouse data not keeping pace with pipeline steps

This morning I was contacted by the business advising me that a report had not refreshed. I dived in and checked all the run records. Everything ran on time. No errors.

Upon further investigation, the data had been fetched from the source system (on-premises SQL Server) into my Bronze lakehouse and had made it to the Silver Lakehouse, but the refresh into the Gold lakehouse had not picked up the new data. It had run and reported no errors, but the Silver Lakehouse had somehow not been up to date at the time that the extract into Gold had executed.

It was as if a transaction had not committed.

By the time I dived in, the Silver-to-Gold dataflow was seeing the correct data, so all I had to do was refresh a few dataflows but I do not understand how this could happen.

To be clear, these are dataflow gen2 steps against Lakehouses in a single data pipeline. Each step is run on success of the previous. So, for example, the Bronze-to-silver dataflow gen2 runs, works, completes with success, but the subsequent Silver-to-Gold dataflow gen2, on this occasion, did not see the up to date data. It did see it some minutes later.

How can this be? How can the pipeline initiate the execution of the subsequent step on-success before the data is "settled" for want of a better term, in the lakehouse?

This is an infrequent issue but enough to worry the business.

Do I have a fundamental misunderstanding of how these artefacts work?

23 REPLIES 23
smeetsh
Resolver I
Resolver I

Hi All,

 

An update from my side. This problem is still ongoing, and I get the feeling this won't be fixed anytime soon. I have now incorporated a 5 minute! wait before the next step, even with our limited datasets 2 minutes wasn't always enough. 

After months of being bounced around with first line support, I finally got to speak to someone in what was described to me as "the product team" and this person seemed to understand the issue and surmised that the data was writing promptly but the delta log was the source of the delay, i.e. the metadata that tells reading processes which files/versions to read.

I left that teams call with a sense of hope but after a few days I got a request from first line support asking that felt a little like support-tennis but I haven't given up yet.

 

Anecdotally, (and probably as a way of cursing the next run), the issue has been far less prevalent in the last couple of weeks for me.

 

The other long standing support call I have open where it is consuming mountains of CU to write handfuls of records is still ongoing with the position being that there is still some upgrade of the lakehouse writing process due to be released soon, which I am hoping could positively affect this issue too, but I have little or no evidence to support that hope.

I have had the pleasure to of talking to the "the product team" , they were very understanding and aware of the issue ... that was well over half a year ago, and the September fix never materialised. I wish there was an easy way, using a notebook, to just refresh the SQL endpoint, but I have yet to find one. The 5 minute manual delay is not a big thing, but I determined it by trial and error and who knows what happens if the datasets get larger and/or more complex.

smeetsh
Resolver I
Resolver I

I am not sure if this is related, but at the moment there is a delay between data entering a lakehouse and it being represented in its SQL endpoint. (https://learn.microsoft.com/en-us/fabric/data-warehouse/sql-analytics-endpoint-performance)

This seems to have been introduced in last months update. It could take upto a minute or more for it to be reflected.

Thanks for this. On the face of it, I believe I am reading directly from the Lakehouse and not the endpoint, but if, under the hood, DataFlow Gen2 does indeed read via the SQL endpoint or at least suffers in the same way, then I guess that is an explanation.

I am still worried about this. I have an incredibly small amount of data. The whole workspace is less than 20Gb (last time I looked) including all three lakehouses, so the actual working data is far less than that. 

The business generates hundreds or low thousands of records a day and yet I seem to be up against issues related to load.

 

I have implemented Wait activities (75 seconds) between the writes and associated reads and the problem is not persisting. My support case with Microsoft is ongoing but there has been no explanation offered to date.

I am having an identical experience to the one you describe in this thread. Before finding this thread, I added a 60 second delay in my pipeline, but the issue still happens, but more infrequently now. 

Did you find a better solution or has the 75 seconds delay solved it for you? 

Apologies for not responding sooner. Further apologies for not having much positive to say. This issue is ongoing for me. I have had an open support call with Microsoft for many months now and they recently said that they were able to recreate the issue but then it all went quiet (again). 

I keep increasing the Wait periods, with some of them now in excess of 2 minutes. 

One that exhibited the problem this morning had a period of 99 seconds so I have now extended that.

I have another long running support issue with Microsoft where my CU consumption is suspiciously high (eg writing 20 records into a table consumes about half the CU of an F2 SKU). They have suggested that this is due to having too many versions of the table and that a fix is due. As a result of this I am regularly performing the manual maintenance tasks on the affected tables. This slightly improves the CU consumption and I believe (anecdotal feeling only) that the issues are related; i.e. if I am on top of the table maintanance process (including the VACUUM option) then the uncommitted write issue seems less frequent.

I assume the inbound fix is a fairly significant change to how lakehouse tables are written to, so I am holding out a little hope that it has an affect on this issue too. 

"but if, under the hood, DataFlow Gen2 does indeed read via the SQL endpoint or at least suffers in the same way, then I guess that is an explanation."

 

I tested now to create a Dataflow Gen2 which reads from a Lakehouse table.

 

I can then find the query generated by the Dataflow Gen2 in the SQL Analytics Endpoint's queryinsights -> exec_requests_history

 

So I'm quite convinced that the Dataflow Gen2 is actually using the SQL Analytics Endpoint under the hood.

 

Another option is to use the ADLS connector. However I don't think query folding is supported then.

 

I'm not sure if any ADLS read transaction costs will incur if using the ADLS connector. I'm guessing not, because the data is logically stored in OneLake. However I don't know 100% for sure.

chinmayphadke
New Member

Dataflow Gen2 is giving us sleepless nights due to tooooo many bugs. We have faced the scenario you mentioned, in which "table created and populated by one dataflow gen2 is not recognized by another". Additionally, we have also seen that "changes done by notebook are not recognized by dataflow gen2 immediately, sometimes takes 30 minutes" and vice versa too.

 

This dataflow gen2, which is a UI wizard for transformations, is far from maturity. For new users, my suggestion is to use to learn SQL and/or PySpark coding and use notebooks too. You may need some initial time to learn coding, but the efforts are worth, instead of relying on Dataflow Gen2.

Thanks. I agree that PySpark and SQL is probably a more reliable route but I am trying to make this project as low code as possible. I have swallowed the Microsoft Koolade on this one and have gone all draggy-droppy, Lakehouse-only, gen2, Direct Lake in order to try to make it as maintainable as possible by anyone who might follow me with less coding experience. (I am an old client-server SQL dev dinosaur!)

I'm going to persist with this and hope that Microsoft keep improving the product and my patience with work-arounds doesn't run out.

frithjof_v
Super User
Super User

Is your Silver-to-Gold dataflow gen2 sourcing its data from the SQL Analytics Endpoint of the Silver Lakehouse?

 

There could maybe be a delay between data enters the Lakehouse before it is available in the SQL Analytics Endpoint:

Solved: Delayed data refresh in SQL Analytical Endpoint - Microsoft Fabric Community

Solved: SQL Endpoint Slow To Reflect Changes In Lakehouse - Microsoft Fabric Community 

No...as far as I can tell. I will have created these all going "route 1" with the wizard, selecting sources of type Lakehouse as per the image here. I certainly didn't end up pasting in a URL or anything like that. Is that how I would go via the endpoint (out of interest) ?

 

image.png

Hm... I guess that's not the reason, then.

 

If the first lines in the M code (Advanced editor) of your query look like this:

 

 

 

let
  Source = Lakehouse.Contents(null){[workspaceId = "..................."]}[Data]{[lakehouseId = "............."]}[Data],
  #"Navigation 1" = Source{[Id = "Dim_Date", ItemKind = "Table"]}[Data],
...............
...............
...............

 

 

 

or this:

 

 

 

let
  Source = Lakehouse.Contents([]),
  #"Navigation 1" = Source{[workspaceId = "................................"]}[Data],
  #"Navigation 2" = #"Navigation 1"{[lakehouseId = "................................."]}[Data],
  #"Navigation 3" = #"Navigation 2"{[Id = "Dim_Date", ItemKind = "Table"]}[Data],
...............
...............
...............
...............

 

 

 

then I don't think you are using the SQL Analytics Endpoint. [Edit: I just learned that this function also uses the SQL Analytics Endpoint. See next comment. However I have not updated the rest of this comment or other comments in this thread.]

 

 

If you had used the SQL Analytics Endpoint, I think it would have looked like this:

 

 

 

let
  Source = Sql.Databases("<sql-connection-string>.datawarehouse.fabric.microsoft.com"),
  #"Navigation 1" = Source{[Name = "TestLakehouse"]}[Data],
  #"Navigation 2" = #"Navigation 1"{[Schema = "dbo", Item = "Dim_Date"]}[Data],
.............
.............
.............
.............

 

 

 

 

I tried to find out how to connect to the Lakehouse SQL Analytics Endpoint from a Dataflow Gen2, and it seems you would have had to use the SQL Server database or Azure SQL database connector to do that. So it's unlikely that would happen unless you tried to connect to the SQL Analytics Endpoint on purpose. (In Power BI Desktop, which I am used to, the user interface for connecting to SQL Analytics Endpoint is a bit different and more similar to connecting to the Lakehouse tables.)

 

That being said, I am not 100% sure that the two first samples here don't connect to the SQL Analytics Endpoint. 

But I am at least sure that the third sample does connect to the SQL Analytics Endpoint.

 

Anyway, perhaps the cause of your issue is something else.

Update: I just learned that the Lakehouse.Contents() does actually use the SQL Analytics Endpoint.

 

https://www.reddit.com/r/MicrosoftFabric/s/BgkVtXCemD

 

https://www.reddit.com/r/MicrosoftFabric/s/p1dXKa4Tf0

Thanks but my M code looks like your earlier example. "LakehouseId".....", "ItemKind = "Table"" etc.

These are extremely small amounts of data, looking at jobs booked for a day (low hundreds) and which vehicles or drivers may be unavailable (less than 10), so I'm confident I'm not stressing anything. 

 

I don't have other suggestions for what may be the cause.

 

I haven't got a lot of experience with data pipelines with multiple dataflows gen2 in series.

But I think it sounds like a normal way to orchestrate the data transformations through the medallion structure.

So I share your concern about this.

 

Have you been able to reproduce the issue or has it just been a one-off? Maybe it could be solved by using a wait activity in the pipeline. Then again, I don't know how many seconds/minutes to wait.

 

Hope someone else can help on this topic! 

It isn't a one-off but I can't systematically reproduce it. It is infrequent.

How long to wait would indeed be the question.

I'm very grateful for your input and like you I hope we see some more opinions on this.

Btw - and I'm not sure if this has anything to to with it - but is the dataflow doing any table schema changes on your table (adding, renaming or removing columns), or is it only updating the data?

 

And what settings do you have here:

 

frithjof_v_0-1722263030132.png

 

 

frithjof_v_1-1722263030144.png

 

 

Maybe some of these settings may impact how long time it takes before the table is ready for querying in the Lakehouse.

 

(However anyway I think that the data pipeline should not mark an activity as completed before the table is ready to get queried by the next step).  

 

 

I'm not making any schema changes in these dataflows. The settings are "Auto mapping and allow for schema change", but the columns are static. I have noticed, during development, that metadata changes can take some persuasion to make it to all the parts they need to reach so I have avoided being that clever in normal production thus far.

 

I see.

 

So this one: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-data-destinations-and-managed-se...

 

frithjof_v_0-1722269812687.png

 

The docs say that the table is being dropped and recreated on each dataflow refresh.

I did a test about that and I don't think the table is actually being dropped and recreated. I described it here: Re: Dataflow Gen2 - Table not getting dropped and ... - Microsoft Fabric Community

 

Have you tried using a Notebook to check the version history of your table, to see if you can find out what time the ReplaceTable and Update operations took place. If there are any deviations in the timestamps at the time when your dataflow refresh ran, they might help in understanding the issue that occured.

frithjof_v_1-1722270236667.png

 

 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Prices go up Feb. 11th.

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

JanFabricDW_carousel

Fabric Monthly Update - January 2025

Unlock the latest Fabric Data Warehouse upgrades!

JanFabricDF_carousel

Fabric Monthly Update - January 2025

Take your data replication to the next level with Fabric's latest updates!