March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
This morning I was contacted by the business advising me that a report had not refreshed. I dived in and checked all the run records. Everything ran on time. No errors.
Upon further investigation, the data had been fetched from the source system (on-premises SQL Server) into my Bronze lakehouse and had made it to the Silver Lakehouse, but the refresh into the Gold lakehouse had not picked up the new data. It had run and reported no errors, but the Silver Lakehouse had somehow not been up to date at the time that the extract into Gold had executed.
It was as if a transaction had not committed.
By the time I dived in, the Silver-to-Gold dataflow was seeing the correct data, so all I had to do was refresh a few dataflows but I do not understand how this could happen.
To be clear, these are dataflow gen2 steps against Lakehouses in a single data pipeline. Each step is run on success of the previous. So, for example, the Bronze-to-silver dataflow gen2 runs, works, completes with success, but the subsequent Silver-to-Gold dataflow gen2, on this occasion, did not see the up to date data. It did see it some minutes later.
How can this be? How can the pipeline initiate the execution of the subsequent step on-success before the data is "settled" for want of a better term, in the lakehouse?
This is an infrequent issue but enough to worry the business.
Do I have a fundamental misunderstanding of how these artefacts work?
I am not sure if this is related, but at the moment there is a delay between data entering a lakehouse and it being represented in its SQL endpoint. (https://learn.microsoft.com/en-us/fabric/data-warehouse/sql-analytics-endpoint-performance)
This seems to have been introduced in last months update. It could take upto a minute or more for it to be reflected.
Thanks for this. On the face of it, I believe I am reading directly from the Lakehouse and not the endpoint, but if, under the hood, DataFlow Gen2 does indeed read via the SQL endpoint or at least suffers in the same way, then I guess that is an explanation.
I am still worried about this. I have an incredibly small amount of data. The whole workspace is less than 20Gb (last time I looked) including all three lakehouses, so the actual working data is far less than that.
The business generates hundreds or low thousands of records a day and yet I seem to be up against issues related to load.
I have implemented Wait activities (75 seconds) between the writes and associated reads and the problem is not persisting. My support case with Microsoft is ongoing but there has been no explanation offered to date.
I am having an identical experience to the one you describe in this thread. Before finding this thread, I added a 60 second delay in my pipeline, but the issue still happens, but more infrequently now.
Did you find a better solution or has the 75 seconds delay solved it for you?
"but if, under the hood, DataFlow Gen2 does indeed read via the SQL endpoint or at least suffers in the same way, then I guess that is an explanation."
I tested now to create a Dataflow Gen2 which reads from a Lakehouse table.
I can then find the query generated by the Dataflow Gen2 in the SQL Analytics Endpoint's queryinsights -> exec_requests_history
So I'm quite convinced that the Dataflow Gen2 is actually using the SQL Analytics Endpoint under the hood.
Another option is to use the ADLS connector. However I don't think query folding is supported then.
I'm not sure if any ADLS read transaction costs will incur if using the ADLS connector. I'm guessing not, because the data is logically stored in OneLake. However I don't know 100% for sure.
Dataflow Gen2 is giving us sleepless nights due to tooooo many bugs. We have faced the scenario you mentioned, in which "table created and populated by one dataflow gen2 is not recognized by another". Additionally, we have also seen that "changes done by notebook are not recognized by dataflow gen2 immediately, sometimes takes 30 minutes" and vice versa too.
This dataflow gen2, which is a UI wizard for transformations, is far from maturity. For new users, my suggestion is to use to learn SQL and/or PySpark coding and use notebooks too. You may need some initial time to learn coding, but the efforts are worth, instead of relying on Dataflow Gen2.
Thanks. I agree that PySpark and SQL is probably a more reliable route but I am trying to make this project as low code as possible. I have swallowed the Microsoft Koolade on this one and have gone all draggy-droppy, Lakehouse-only, gen2, Direct Lake in order to try to make it as maintainable as possible by anyone who might follow me with less coding experience. (I am an old client-server SQL dev dinosaur!)
I'm going to persist with this and hope that Microsoft keep improving the product and my patience with work-arounds doesn't run out.
Is your Silver-to-Gold dataflow gen2 sourcing its data from the SQL Analytics Endpoint of the Silver Lakehouse?
There could maybe be a delay between data enters the Lakehouse before it is available in the SQL Analytics Endpoint:
Solved: Delayed data refresh in SQL Analytical Endpoint - Microsoft Fabric Community
Solved: SQL Endpoint Slow To Reflect Changes In Lakehouse - Microsoft Fabric Community
No...as far as I can tell. I will have created these all going "route 1" with the wizard, selecting sources of type Lakehouse as per the image here. I certainly didn't end up pasting in a URL or anything like that. Is that how I would go via the endpoint (out of interest) ?
Hm... I guess that's not the reason, then.
If the first lines in the M code (Advanced editor) of your query look like this:
let
Source = Lakehouse.Contents(null){[workspaceId = "..................."]}[Data]{[lakehouseId = "............."]}[Data],
#"Navigation 1" = Source{[Id = "Dim_Date", ItemKind = "Table"]}[Data],
...............
...............
...............
or this:
let
Source = Lakehouse.Contents([]),
#"Navigation 1" = Source{[workspaceId = "................................"]}[Data],
#"Navigation 2" = #"Navigation 1"{[lakehouseId = "................................."]}[Data],
#"Navigation 3" = #"Navigation 2"{[Id = "Dim_Date", ItemKind = "Table"]}[Data],
...............
...............
...............
...............
then I don't think you are using the SQL Analytics Endpoint. [Edit: I just learned that this function also uses the SQL Analytics Endpoint. See next comment. However I have not updated the rest of this comment or other comments in this thread.]
If you had used the SQL Analytics Endpoint, I think it would have looked like this:
let
Source = Sql.Databases("<sql-connection-string>.datawarehouse.fabric.microsoft.com"),
#"Navigation 1" = Source{[Name = "TestLakehouse"]}[Data],
#"Navigation 2" = #"Navigation 1"{[Schema = "dbo", Item = "Dim_Date"]}[Data],
.............
.............
.............
.............
I tried to find out how to connect to the Lakehouse SQL Analytics Endpoint from a Dataflow Gen2, and it seems you would have had to use the SQL Server database or Azure SQL database connector to do that. So it's unlikely that would happen unless you tried to connect to the SQL Analytics Endpoint on purpose. (In Power BI Desktop, which I am used to, the user interface for connecting to SQL Analytics Endpoint is a bit different and more similar to connecting to the Lakehouse tables.)
That being said, I am not 100% sure that the two first samples here don't connect to the SQL Analytics Endpoint.
But I am at least sure that the third sample does connect to the SQL Analytics Endpoint.
Anyway, perhaps the cause of your issue is something else.
Update: I just learned that the Lakehouse.Contents() does actually use the SQL Analytics Endpoint.
https://www.reddit.com/r/MicrosoftFabric/s/BgkVtXCemD
Thanks but my M code looks like your earlier example. "LakehouseId".....", "ItemKind = "Table"" etc.
These are extremely small amounts of data, looking at jobs booked for a day (low hundreds) and which vehicles or drivers may be unavailable (less than 10), so I'm confident I'm not stressing anything.
I don't have other suggestions for what may be the cause.
I haven't got a lot of experience with data pipelines with multiple dataflows gen2 in series.
But I think it sounds like a normal way to orchestrate the data transformations through the medallion structure.
So I share your concern about this.
Have you been able to reproduce the issue or has it just been a one-off? Maybe it could be solved by using a wait activity in the pipeline. Then again, I don't know how many seconds/minutes to wait.
Hope someone else can help on this topic!
It isn't a one-off but I can't systematically reproduce it. It is infrequent.
How long to wait would indeed be the question.
I'm very grateful for your input and like you I hope we see some more opinions on this.
Btw - and I'm not sure if this has anything to to with it - but is the dataflow doing any table schema changes on your table (adding, renaming or removing columns), or is it only updating the data?
And what settings do you have here:
Maybe some of these settings may impact how long time it takes before the table is ready for querying in the Lakehouse.
(However anyway I think that the data pipeline should not mark an activity as completed before the table is ready to get queried by the next step).
I'm not making any schema changes in these dataflows. The settings are "Auto mapping and allow for schema change", but the columns are static. I have noticed, during development, that metadata changes can take some persuasion to make it to all the parts they need to reach so I have avoided being that clever in normal production thus far.
I see.
So this one: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-data-destinations-and-managed-se...
The docs say that the table is being dropped and recreated on each dataflow refresh.
I did a test about that and I don't think the table is actually being dropped and recreated. I described it here: Re: Dataflow Gen2 - Table not getting dropped and ... - Microsoft Fabric Community
Have you tried using a Notebook to check the version history of your table, to see if you can find out what time the ReplaceTable and Update operations took place. If there are any deviations in the timestamps at the time when your dataflow refresh ran, they might help in understanding the issue that occured.
Thanks for that. Very useful.
More than one pipeline "failed" in this way today.
I have realised that one of the misbehaving operations is actually a notebook that deletes clashing incoming IDs from the target in Silver followed by a simple append using a pipeline Copy activity from Bronze to Silver, so the table wouldn't be recreated. In any case the timestamp indicates that the update finished just under a minute before the subsequent dataflow ran and didn't find the data in Silver.
Another one is a dataflow gen2 performing a Replace. I have checked both using the SQL history command you have kindly pointed out and they both finished the subsequent step just under one minute before the step that didn't read the new data. Both wrote into the Silver lakehouse in their different ways. The simpler dataflow replace operation was the second to run, but both sets of new data (in different tables, one appended and one replaced) were unseen by the dataflows that subsequently ran to pick up from Silver and move to Gold.
It is as if the Silver lakehouse just didn't want to be rushed into giving up it's new data quite so soon.
Interesting!
Then I don't understand why the SilverToGold Dataflow Gen2 fails to pick up the new data. According to the timings which you mention, the new data should already be in Silver almost one minute before running the SilverToGold Dataflow Gen2, right?
Does the SilverToGold Dataflow Gen2 replace or append data into Gold?
In the "failed" runs, does the SilverToGold Dataflow Gen2 pick up data (i.e. old data) from the silver table, or doesn't it pick up any data from the silver table (i.e. it thinks the table is empty)?
I.e. does it seem like the SilverToGold dataflow sees the old data in the Silver table, or does it seem like the SilverToGold sees an empty table when it looks at the Silver table?
Anyway, I hope someone else can make sense of this and explain why this is happening.
Or I guess you will need to create a support ticket.
If you have thought about a workaround, i.e. insert a wait activity, I guess I would have tried that.
The dataflows that populate Gold do so on a replace basis. This morning they definitely found old data (the report (and Gold Lakehouse) showed data from Friday when the pipeline last ran) rather than empty tables.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
9 | |
3 | |
3 | |
2 | |
1 |
User | Count |
---|---|
16 | |
12 | |
9 | |
7 | |
5 |