Supplies are limited. Contact info@espc.tech right away to save your spot before the conference sells out.
Get your discountScore big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount
I'm having an unusual problem with dataflow GEN2.
I have several tables that are configured with "enable staging".
After the individual tables are loaded, they are combined into a final table which is where our problem appears.
I have not configured a target destination for the dataflow. I'm assuming that the default storage format is being used internally (parquet or similar).
Here are the tables that are successfully loaded with "enable staging" (four of them). The behavior in the preview (nulls presented from table) is the same as the behavior elsewhere. The four tables are good, and work as expected....
... however the final step is to combine all these together into one final mega-table. That is the table that becomes corrupted. (red underlines show nulls. The nuls are NOT found in any of the source tables).
Here is the code to combined the tables together.
Please let me know if anyone has seen this behavior where nulls are generated when combining data together. The four individual tables have about 2 million rows each. The final combined version of the data should be 4x that size.
Solved! Go to Solution.
For folks who want to avoid a table full of nulls, I asked the dataflow PG how to prevent data from getting stored in a DW table.
The way you do this is by using the function "Table.StopFolding" as the last step in a GEN2 dataflow, when preparing an entity.
That trick will force the data to remain in the "DataflowsStagingLakehouse" (parquet), and will never live in a DW table. It avoids all bugs related to DW tables. It may also be less expensive to use this for long-term storage and infrequent access. If you are only using the data for performing full table scans (filtered on one or two columns), then performance may not be that degraded as compared to a DW table. I am convinced that DW tables are probably overkill for most use-cases related to dataflows. And they are certainly not worth the risk of being presented with a bunch of null data!
Below you can see that the same dataflow which used to send a table to the DW is now keeping everything inside the LH.
I hope this helps anyone who encounters a table full of nulls in Fabric.
You might also refer to the ICM attached to this SR: TrackingID#2504100040007682
In that ICM I was able to reach out to at least two of the Microsoft PG's and explained that customers are being impacted in a consistent and repeating way. I think they were already aware to some degree, but perhaps they have not introduced any telemetry or monitoring for the problem so they don't now how serious it is yet.
I am pretty content with the workaround, given that I don't actually interact with Fabric DW directly for my own solutions. I only encountered this bug indirectly by way of the internal implementation of dataflow GEN2. If others aren't very happy with the workaround, then I encourage you to spend your own time with Mindtree & Microsoft. I already did my fair share of work on this, overy the past couple weeks... I think Microsoft will be more inspired to fix this if others would reach out as well. Please expect at least 20 hours of effort at a minimum, and let them know there is already an ICM (that will save you another 20 hours).
A known issue was finally published for this bug with nulls being inserted into DW tables.
... according to the article it is a rare issue. However in the reddit community there are currently a half dozen customers who are discussing this problem in several different threads. I'm sure those individuals represent only about 3 to 5% of the people who are impacted.
Hi @dbeavon3,
May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.
Thank you.
Hi @dbeavon3,
Thanks for reaching out to the Microsoft fabric community forum.
One potential cause for null values appearing during the table merge process is data type mismatches or schema mismatch across the source tables. If there's a discrepancy in data types such as one table having a column as text while another has it as a number or if a column exists in one table but not in another, Power Query will still perform the Table.Combine, but it will fill in null for missing columns in each respective table. To prevent this, ensure that all corresponding columns across the source tables have consistent data types and consistent schema before performing the merge. It’s recommended to explicitly define the data types during the transformation steps to maintain schema uniformity and avoid unexpected results.
If you find this post helpful, please mark it as an "Accept as Solution" and consider giving a KUDOS. Feel free to reach out if you need further assistance.
Thanks and Regards
Hi @v-nmadadi-msft
I appreciate the comment. We are working with Mindtree on an SR (Tracking ID #2504100040007682) Unfortunately I do NOT think Microsoft has an ICM for this issue yet, but hopefully the bug will be reported to Microsoft soon. I'm working on that.
I don't think there are any type mismatches... although perhaps it is still possible that the SQL Analytics Engine might be confused about my data format. I suspect you are right that if there is some sort of internal datatype or synch issue, then it originates with the "SQL Analytics Engine". It is not intuitive to the user, and the source of our problem is buried very deeply in the inner workings of the dataflow.
Again here is the step to combine data tables:
This step is similar to what is described in the blog:
https://blog.fabric.microsoft.com/en-us/blog/data-factory-spotlight-dataflows-gen2?ft=All
You can see the red box below, which is where I suspect our data is being replaced by nulls.
Interestingly I think this may be a timing issue.
... If I create a different dataflow and perform the Table.Combine at a much later time, in a totally different dataflow, then I'm able to get the results appended together in the expected way. In the past I've heard of bugs in the SQL Analytics Engine, that are referred to as "synch bugs". While I know that users are often expected to implement workarounds in their python notebooks, I did NOT believe there were cases where dataflow developers were expected to perform additional steps to avoid the "synch bugs"
>> "It’s recommended to explicitly define the data types during the transformation steps to maintain schema uniformity and avoid unexpected results."
I have not ever seen any recommendation like this. Do you have a link? Remember that a large part of what PQ accomplishes is to determine the proper schema that would describe our data. In fact, whenever we publish a dataflow, we must wait a very long time for the "evaluation" phase to execute. It can be a very long wait, and its only goal is to reflect upon the schema that would allow the data to be persisted in storage. If it was the user's job to define the schema, then an "evaluation" phase would be unnecessary.
Hi @miguel or @SidJay
I'm waiting for the MT folks to create an ICM for you. They have agreed that something unexpected/buggy is happening in the SQL Analytics Endpoints for DF GEN2 in the East US region.
Please see the diagram in the discussion above, which shows the type of activity that is failing. We gather large quantities of data by year (Y1, Y2, Y3, Y4) and then combine them together. But the combination always results in nulls. Everything happens sequentially in the same dataflow. It is almost exactly what is described in this blog:
Everything falls over when appending the transactions together via Table.Combine. The SQL AE engine seems to be confused, or is not properly synched. Maybe the "DataflowsStagingLakehouse" is buggy, or having a hard time keeping up, or is suffering from a recent regression?
If we have trivially small amounts of data then things work. But when we get above 5 million rows or so, then the SQL AE engine (enhanced compute) seems to be overwhelmed.
We are using "default" storage destinations, and keeping things as simple as possible (no fast copy, no concurrent PQ queries). But things are still not working. It may be a timing issue related to hidden work happening internally. Or perhaps there is some sort of maintenance operation that is neglected, like vacuuming.
I would really like to use a profiler tool to see what is going wrong, but I don't think there is any interface to observe the "dataflow staging" resources used for our GEN2 dataflows (in DataflowsStagingLakehouse ) .
I'd really love to get support via MT. In the meantime here are my next steps in this trial-and-error journey:
- Use Fast Copy - I've tried to avoid this since it seems even more complex, and introduces even more moving parts. I saw that the feature relies on ADF functionality, and I am even more wary of any software components from ADF team, than from Power BI.
- Require Fast Copy - maybe this will help?
- Sprinkle "function.invokeafter" into my dataflows, and make them delay so that the SQL AE engine can keep up. Maybe ten minute delays would help. Unfortunately Microsoft charges us CU's for GEN2 dataflows using wall-time, even when we aren't doing real work. So I want to limit the use of this workaround if possible.
- Stop using the default destination, and start managing my own lakehouse tables. (This is not really my preference. If Microsoft has trouble managing its own "DataflowsStagingLakehouse", then I'm probably not going to do much better if I independently manage my own lakehouse. There is no reason for me to believe it would permanently avoid the same bug.) Perhaps this would be useful for troubleshooting, if I can recreate the bug on a custom lakehouse, and attach a sql profiler tool or run health-monitoring queries, or something.
I'm a bit surprised there aren't more posts about this issue where nulls appear in the post-staging data for a GEN2 dataflow. I only found a couple other users who mentioned it. But I've had consistent and reproducible issues in the past week or so.
The ICM is created now. It should be attached to the SR:
Tracking ID #2504100040007682
This is a serious issue for us. Maybe it is region-specific? In addition, it also seems to be a regression since things worked fine for many, many months.
Thank you for raising this, @dbeavon3 !
I can see that the issue has reached the PG today as an IcM. Let's continue the investigation and communication over the IcM.
We are monitoring the different channels to see if other users are experiencing anything similar, but the investigation for your specific case is ongoing according to the IcM.
thank you for your patience!
Previously I suspected that the PG had forgotten to synch the SQL Analytics Endpoint between each staged table that was created in the lakehouse. However, I turned on the "additional logs" in the gateway and it shows the synch happening. They are definitely interacting with the lakehouse to do the required maintenance of the SQL endpoint. At least when things are working properly, we should expect that the SQL endpoint will be kept healthy as well.
The final action to poll for the metadata refresh is like so:
"Action":"Engine/Module/Lakehouse/IO/MetadataRefresh/Poll"
... And it clearly specifies that the SQL Sync State is Success:
"sqlSyncState":"Success"
At least this gives me a bit more confidence. I was nervous that the LH staging operation was not aware of the need to update the SQL AE metadata.
Another update.
I found out that data is internally sent to different locations. Things work fine when data is being stored in "DataflowsStagingLakehouse". Whereas one of our largest tables in the dataflow has the combined data from multiple LH tables; and for some reason the data in this table is stored in "DataflowsStagingWarehouse".
I would guess that there is a recent regression. Perhaps the PG recently decided to use the "warehouse" for some types of scenarios, and that is causing problems when large amounts of data need to be loaded. I don't have problems when there are a smaller number of rows. I'm not sure where I crossed a threshold that causes things to stop working properly.
Here is the SSMS presentation that shows some of my tables landing in the lakehouse and others in the warehouse:
It is quite frustrating that the data in my GEN2 dataflow is handled in such different ways, for reasons that aren't explained to us. There is nothing in the GEN2 online editor that indicates these tables will be sent into different types of internal storage.
I suppose that developers who actually care about the storage implementation are expected to configure/customize the default destination. I guess that will be my next step ... especially as long as this bug remains.
For folks who want to avoid a table full of nulls, I asked the dataflow PG how to prevent data from getting stored in a DW table.
The way you do this is by using the function "Table.StopFolding" as the last step in a GEN2 dataflow, when preparing an entity.
That trick will force the data to remain in the "DataflowsStagingLakehouse" (parquet), and will never live in a DW table. It avoids all bugs related to DW tables. It may also be less expensive to use this for long-term storage and infrequent access. If you are only using the data for performing full table scans (filtered on one or two columns), then performance may not be that degraded as compared to a DW table. I am convinced that DW tables are probably overkill for most use-cases related to dataflows. And they are certainly not worth the risk of being presented with a bunch of null data!
Below you can see that the same dataflow which used to send a table to the DW is now keeping everything inside the LH.
I hope this helps anyone who encounters a table full of nulls in Fabric.
You might also refer to the ICM attached to this SR: TrackingID#2504100040007682
In that ICM I was able to reach out to at least two of the Microsoft PG's and explained that customers are being impacted in a consistent and repeating way. I think they were already aware to some degree, but perhaps they have not introduced any telemetry or monitoring for the problem so they don't now how serious it is yet.
I am pretty content with the workaround, given that I don't actually interact with Fabric DW directly for my own solutions. I only encountered this bug indirectly by way of the internal implementation of dataflow GEN2. If others aren't very happy with the workaround, then I encourage you to spend your own time with Mindtree & Microsoft. I already did my fair share of work on this, overy the past couple weeks... I think Microsoft will be more inspired to fix this if others would reach out as well. Please expect at least 20 hours of effort at a minimum, and let them know there is already an ICM (that will save you another 20 hours).
Hi dbeavon3,
The product group is actively investigating your issue.
The use of the staging warehouse is not new - it has always existed for combining staged data (both Append and Merge) as well as downstream transformations in general.
We are investigating whether the SQL endpoints or other parts of the stack may have changed behavior and will be sure to keep you posted.
Thanks
Thanks @SidJay
>> The use of the staging warehouse is not new
The bug is not new either. You can find another customer who reported it three months ago. In their case the problem seems to have been a one-time occurrence and they brushed it off a lot more quickly than I would have.
I think the thing that is new is a change in the factors that cause the bug to surface.
As-of this past week, we were observing it in a repeated way. I hoped the nulls would go away, but each day the problem kept repeating!
FYI, As you may know, there is little transparency when working thru your Mindtree organization. Those engineers are paralyzed to share anything useful about Microsoft's bugs (big or small). I almost end up getting less help than I would by remaining in this community (and reddit). Sometimes I simply try to find workarounds for Fabric bugs. This may be how your other customers feel as well, and may explain why so many fabric bugs don't make their way back to the PG. The support structure itself is counterproductive. Customers would rather feel the pain of the bugs, than to feel the pain of this CSS support structure.
@v-nmadadi-msft
Based on your user name, I cannot tell if you are from Microsoft. Are you with Microsoft or Mindtree or another vendor/partner? I'm eager to hear your input about what might be going wrong. Especially if you are an authoritative representative of Microsoft and have first-hand experience with the platform. I believe there is some sort of unexpected regression going on, and I'm hoping you would be able to confirm one way or another.
I am hopeful that the null issue will go away. It might be a rare glitch.
I really would prefer that the dataflow would fail rather than generating bad data. But it will be hard to create a repro if this issue doesn't arise very often.
I found one other post in the community that seems to represent the same problem. A user got a bunch of nulls generated from a GEN2 dataflow, and then the problem goes away as suddenly as it appears.
User | Count |
---|---|
32 | |
17 | |
11 | |
9 | |
8 |
User | Count |
---|---|
49 | |
31 | |
25 | |
17 | |
15 |