Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
smpa01
Super User
Super User

DF Staging Q

I haven't worked with DF Gen2 at all. Hence, the folowing questions.

 

If I don't want to publish my DF table to a Data destination, without Enable Staging, my DF table (authored with PQ) is not visible to power bi's dataflow connector (dependency DF's internal storage much like GEN1).

 

However, if I want to publish my DF to a Data detination (Lakehouse) should I Enable Staging or not.

 

Without Enabling Staging, will the data still publish to a destination (e.g. Lakehouse)? If it gets published, will it still update in the detination upon DF refresh with disable staging?

With Enable Staging, does the data published in Lakehose have any chance of duplication?

 

I am seeking best practice advise when publish to a destination what should I do for staging that does not rseult in data duplication in destination?

 

Also, for Update = Append , should I enable/disable staging (does enable/disable staging matter at all)?

 

Also, upon publishing the table to LH, I see this weird named table got published as well to LH. Why does it happen and what should I do with it? It is a replica of same dim_propchange without he headers. dim_propchange is a table authored in DFGen2

, sources coming from SP and published to LH with Update=Replcae and Staging Disabled.

 

smpa01_0-1722994213624.png

 

 

@frithjof_v 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs
4 REPLIES 4
v-zhangtin-msft
Community Support
Community Support

Hi, @smpa01 

 

When publishing your Dataflow (DF) table to a Data destination like Lakehouse, it is generally recommended to disable staging to improve performance. When staging is enabled, ingestion will take more time. By default, staging is disabled when loading data into the Lakehouse or other non-warehouse destinations. This means that the data is directly written to the data destination without using staging.

Data Factory Spotlight: Dataflow Gen2 | Microsoft Fabric Blog | Microsoft Fabric

 

Best Regards,

Community Support Team _Charlotte

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

There is some information in this comment which is new to me:

 

"However, the data might not be as clean or organized as it would be with staging enabled."

 

"However, there might be a risk of inconsistencies or incomplete data updates since staging helps in managing incremental changes and ensuring data integrity."

 

"Enabling staging helps in managing and organizing the data before it is published to the destination. This process reduces the risk of data duplication and ensures that the data is clean and consistent."

 

"In summary, enabling staging is generally recommended to ensure data integrity, avoid duplication, and manage incremental updates effectively."

 

Where did you find this information?

 

This information is surprising to me and I would like to get more information about this. Can you please explain more about why there is a risk of inconsistencies or incomplete data when the enable staging option is disabled?

 

Disable staging is, after all, the default setting in Dataflows Gen2 when loading to Lakehouse: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-data-destinations-and-managed-se...

 

I thought the primary purpose of enable/disable staging was related to performance optimization of the Dataflow Gen2.

Thanks for this @frithjof_v . yes, the default is Disable staging

 

Dataflow Gen2 is not even comparable to notebook in terms of performance.  I am not surprised by that and I don't expect PQ to be faster than notebook's distributed processing.

However, there are situations when I don't have an option other than to rely on DFGen2 (e.g. sharepoint). In future, if I can procure accees to Graph API, I can discard this option.

 

@v-zhangtin-msft  @miguel can you please validate the comments from @frithjof_v 

 

Two things, I care about most if I have to rely of DF Gen2, in terms of priority

 

a. What do I need to do to ensure data is not duplicated + Incremental Refresh (willing to overlook performance) in destination (e.g. lakehouse)? (Staging or Disable Staging)

 

b. if a is satisifed, what are the possible performance tuning available?

 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs
frithjof_v
Resident Rockstar
Resident Rockstar

I haven't studied so much about the topic of Enable / Disable staging.

 

I hope this helps: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-data-destinations-and-managed-se...

 

I don't think staging/no staging should have any impact on duplication of data in the destination.

 

Regarding that strange table name, I have not seen or heard about that before.

Helpful resources

Announcements
Sept Fabric Carousel

Fabric Monthly Update - September 2024

Check out the September 2024 Fabric update to learn about new features.

Expanding the Data Factory Forums

New forum boards available in Data Factory

Ask questions in Apache Airflow Job and Mirroring.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

Sept NL Carousel

Fabric Community Update - September 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors