Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now

Reply
jurgenp
Regular Visitor

What are the 'DataflowsStagingLakehouse' SQL endpoint and Dataset?

Hi,

 

Evertime I create the first Dataflow Gen2 inside a Workspace, I also see a SQL endpoint and Dataset named 'DataflowsStagingLakehouse' being created. If I delete them, my Dataflows start to fail?

 

There is no mention of these that I could find in the documentation so I was wondering what their purpose is?

 

Jurgen

 

 

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hello @jurgenp , 

 

July 11th update: We recently posted a new blog detailing Dataflow Gen2 architecture.  which covers how these artifacts are used by Dataflows Gen2

 

Dataflows Gen2, like Dataflows Gen1, leverage staging storage to store the data produced by refreshing queries that are load enabled (the default option). This storage location for Dataflows Gen2 is the "DataflowStagingLakehouse" artifact you mentioned. 

 

Dataflows Gen2, like Dataflows Gen1, also leverage an enhanced compute engine to improve performance of data reads and transformations. This is the DataflowStagingWarehouse. It is leveraged in a few scenarios:

  1. When you get data from a dataflow via the Dataflow connector in another Dataflow, BI Desktop, Excel and other Power Query experiences where the dataflow connector is present. 
  2. When you reference a "load enabled" query in the dataflow. The dataflow engine will first stage the data from the first query, and then leverage the compute capabilities of the DataflowStagingWarehouse to transform the date as scale.

 

These artifacts should not be removed. They are essential to dataflow gen2’s operation. 

 

A few comments:

  1. For any workspace with Dataflows Gen2, only a single DataflowStagingLakehouse and Warehouse artifact will be created. i.e., if you create 10 dataflows, there will still only be one of each.
  2. At the moment, as a side effect, the DataflowStagingLakehouse and Warehouse artifacts creation provisions a few more artifacts (One SQL Endpoint, and two Datasets). This will eventually be addressed.
  3. You can configure a query to not load. An example where you may use this feature is when you have a small dataset you want to load to an output destination. In that case, you may want to ETL in memory - i.e., the dataflow will extract data from the source, transform it in memory, and load it directly to the data destination. At the moment, at least one query in the dataflow has to be load enabled, but we are planning to remove that restriction. 

 

I want to again re-iterate the importance of not deleting these items. They are required for dataflows to operate and should be viewed as system artifacts. We are planning to make changes in the future to both improve the customer experience and prevent their deletion. 

 

I am planning to add documentation with more details about how dataflows Gen2 work, their use of Staging Storage and Compute, and best practices to get the most out of Dataflows Gen2 architecture. 

 

Thank you,

Ben

 

View solution in original post

7 REPLIES 7
R1k91
Solution Supplier
Solution Supplier

I'd expect them to be hidden. they're system objects that make confusion


--
Riccardo Perico
BI & Power BI Engineer @ Lucient Italia

Blog | GitHub

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Anonymous
Not applicable

Hello @jurgenp , 

 

July 11th update: We recently posted a new blog detailing Dataflow Gen2 architecture.  which covers how these artifacts are used by Dataflows Gen2

 

Dataflows Gen2, like Dataflows Gen1, leverage staging storage to store the data produced by refreshing queries that are load enabled (the default option). This storage location for Dataflows Gen2 is the "DataflowStagingLakehouse" artifact you mentioned. 

 

Dataflows Gen2, like Dataflows Gen1, also leverage an enhanced compute engine to improve performance of data reads and transformations. This is the DataflowStagingWarehouse. It is leveraged in a few scenarios:

  1. When you get data from a dataflow via the Dataflow connector in another Dataflow, BI Desktop, Excel and other Power Query experiences where the dataflow connector is present. 
  2. When you reference a "load enabled" query in the dataflow. The dataflow engine will first stage the data from the first query, and then leverage the compute capabilities of the DataflowStagingWarehouse to transform the date as scale.

 

These artifacts should not be removed. They are essential to dataflow gen2’s operation. 

 

A few comments:

  1. For any workspace with Dataflows Gen2, only a single DataflowStagingLakehouse and Warehouse artifact will be created. i.e., if you create 10 dataflows, there will still only be one of each.
  2. At the moment, as a side effect, the DataflowStagingLakehouse and Warehouse artifacts creation provisions a few more artifacts (One SQL Endpoint, and two Datasets). This will eventually be addressed.
  3. You can configure a query to not load. An example where you may use this feature is when you have a small dataset you want to load to an output destination. In that case, you may want to ETL in memory - i.e., the dataflow will extract data from the source, transform it in memory, and load it directly to the data destination. At the moment, at least one query in the dataflow has to be load enabled, but we are planning to remove that restriction. 

 

I want to again re-iterate the importance of not deleting these items. They are required for dataflows to operate and should be viewed as system artifacts. We are planning to make changes in the future to both improve the customer experience and prevent their deletion. 

 

I am planning to add documentation with more details about how dataflows Gen2 work, their use of Staging Storage and Compute, and best practices to get the most out of Dataflows Gen2 architecture. 

 

Thank you,

Ben

 

Hi Ben,

 

Thanks for sending through your thorough response. I had two questions

 

Question 1:

In the first comment you say that regardless of the number of dataflows created there will only be one warehouse or lakehouse. I seem to be getting multiple? Am I interpreting this incorrectly?

 

Dominoes2_0-1723588797813.png

 

Question 2 (sorry there is actually multiple questions in this one?):

One of my dataflows is just doing some pretty basic transformations from excel files and probably does not need to stage the information in a lakehouse / warehouse. From this I get the following questions:

  1. When you say "Load enabled" are you talking about the "Enable Staging" button when you right click a source?
  2. And so if I did not want the staging lakehouse to be produced I should set this off on all of the sources?
  3. However, there is a limitation where one has to be set to true?
  4. When the release comes out allowing me to set all sources to have staging disabled, if I do so for a dataflow that already has an associated lakehouse, will it be smart enough to remove that lakehouse?

Dominoes2_1-1723588972718.png

Thanks in advance for the response. Cheers.

question1: what will always be one is "DataflowStagingLakehouse" and "DataflowStagingWarehouse". there will be at maximum one per type per workspace and each dataflow gen 2 will stage data in them (if required). lineage view sometimes duplicates "boxes" even if they refer to the same object (it's been a (bug IMHO there for ages).


question2:

 

  1. When you say "Load enabled" are you talking about the "Enable Staging" button when you right click a source? yes
  2. And so if I did not want the staging lakehouse to be produced I should set this off on all of the sources? it will be produced in any case. it's in the backend even if you don't use it
  3. However, there is a limitation where one has to be set to true? some PQ transformation may perform better if data is staged in the backend Staging area (typically joins).
  4. When the release comes out allowing me to set all sources to have staging disabled, if I do so for a dataflow that already has an associated lakehouse, will it be smart enough to remove that lakehouse? it will be produced in any case. it's in the backend even if you don't use it. if you don't need it remove all staging and set the destination for all the needed query they'll be streamed from the source directly to the destination without being staged thanks to the Mashup engine.

 

I personally understood much more about this strange architecture when I first connected to SQL endpoint of a lakehouse of warehouse via SSMS. You then realize that is unique by workspace and each lakehouse, warehouse, kql database, and staging databases are installed there (even if invisible) in the UI.


--
Riccardo Perico
BI & Power BI Engineer @ Lucient Italia

Blog | GitHub

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Leaving aside the question of deleting these objects (which makes sense that this would be a bad idea), would you also recommend that we not store data there (e.g. as an output from a gen2 dataflow)?

Anonymous
Not applicable

Yes, while its not prevented today, these artifacts should not be used outside the dataflow experience.

BryanCarmichael
Advocate I
Advocate I

They seem to be internal artifacts taht are used as part fo teh process of creating gen2 daaflows and populating your lakehouse.

 

The general assumption i have heard is that when it goes into production these will be hidden.

 

Most important thing is to not chagne or delete anything in tehre as it does nasty things- in short pretend that they do not exist.

Helpful resources

Announcements
November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Live Sessions with Fabric DB

Be one of the first to start using Fabric Databases

Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.

November Update

Fabric Monthly Update - November 2024

Check out the November 2024 Fabric update to learn about new features.

Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.