Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
iBusinessBI
Kudo Collector
Kudo Collector

Dataflow vs. Dataflow Gen2 vs. Warehouse

I have the following scenario:

1. Step 1 - I want to connect to Salesforce and get an Opportinities table AS-IS

2. I want to connect to the table from Step 1 and do all kinds of complex transformations with Power Query.

 

My task is to make an API request JUST ONCE to Salesforce. All the rest load should be done inside Microsoft capacity. So I want to separate the two steps most efficiently.

(When I do everything inside ONE Dataflow - the second step is sending API requests again and again and again to Salesforce so it is not an option)

 

I have created 2 dataflows and the second one refers to the first (it created a "linked" entity with enhances mode, watever it means). And I've created a Power Automate flow so the second dataflow will refresh after the first is finished.

But it still takes a long time for the second one to run. And I wonder whether it still propagates the API calls to Salesforce because of "linked entity".

 

I am thinking to test another alternative:

Create a Gen2 Dataflow, save the initial "AS-IS" table in a Warehouse/or Lakehouse.
Then, connect the second Dataflow to it. And schedule everything with Pipeline.

 

Should there be differences in this approach?

Or is it more or less the same as just separating two Gen1 dataflows?

 

What does the theory tell?

1 ACCEPTED SOLUTION

Thanks for the follow up! 

 

  1. The output can become anything you want! I believe you needed to trigger an API in salesforce before the refresh starts, if that is not the case you can ignore this activity in the pipeline and straight away run the first dataflow to consume from salesforce and load it into your lakehouse
  2. Some advantages compared to gen1 are:
    1. You can decide where to save the data with data destinations feature in gen2, which can easily be accessed by other means. So you are not restricted to the dataflow connector anylonger as you were with dataflows gen1.
    2. Bringing in the data raw into your lakehouse allows you to build different dataflows on top of it for experimenting or other business needs without putting load on your salesforce instance multiple times. 
    3. Gen2 compared to Gen1 the performance for complex transformations will be better as we try to leverage the compute endpoints of the staging lakehouse and warehouse which in many cases can be way faster than the native PowerQuery engine when working with large datasets. 

Hope this clarifies 😁

View solution in original post

7 REPLIES 7
v-cboorla-msft
Community Support
Community Support

Hi @iBusinessBI 

 

Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.

LuitwielerMSFT
Microsoft Employee
Microsoft Employee

Hi BusinessBI! 

 

My name is Jeroen Luitwieler and a Senior Product Manager on the Dataflow/dataintegration team. 

 

Looking at your case you may want to consider the following: 

LuitwielerMSFT_0-1701965571001.png

Steps:

  • Create a lakehouse in your workspace
  • Create a pipeline in your fabric workspace
    • Add a Web activity to sent a http request to salesforce
    • On success, run the first Dataflow Gen2 that:
      • Raw ingest all the tables you are interested in
      • Use data destination to load the tables into the lakehouse
      • Disable staging for all tables to improve performance
    • On success of that Dataflow, run another dataflow that:
      • Consumes the ingested data from the lakehouse
      • Apply the transformations to the data
      • Write the results of your transformations into your data destination of your liking
  • Apply the schedule to the data pipeline that fits your business needs

Would this cover your usecase? do you have any questions or concerns with this solution? 

Thanks, @LuitwielerMSFT  Jeroen,

That's interesting.... Some thoughts:

1. Where does the Web activity saves its output to? (What is the source for the "Ingest to lakehouse" DF?)
2. What is the advantage of saving the intermediate results to a Lakehouse?
I could just build 2 regular Gen1 Dataflows:

- the first connects to the Salesforce and brings all the table in the raw format as-is.

- the second DF connects to the first and does all the necessary transformation.

So what advantage I get working with the Lakehouse and Gen2 DF, instead of just 2 regular Gen1 DF?
Thanks 

Thanks for the follow up! 

 

  1. The output can become anything you want! I believe you needed to trigger an API in salesforce before the refresh starts, if that is not the case you can ignore this activity in the pipeline and straight away run the first dataflow to consume from salesforce and load it into your lakehouse
  2. Some advantages compared to gen1 are:
    1. You can decide where to save the data with data destinations feature in gen2, which can easily be accessed by other means. So you are not restricted to the dataflow connector anylonger as you were with dataflows gen1.
    2. Bringing in the data raw into your lakehouse allows you to build different dataflows on top of it for experimenting or other business needs without putting load on your salesforce instance multiple times. 
    3. Gen2 compared to Gen1 the performance for complex transformations will be better as we try to leverage the compute endpoints of the staging lakehouse and warehouse which in many cases can be way faster than the native PowerQuery engine when working with large datasets. 

Hope this clarifies 😁

Jonan
New Member

The theory tells you that the 2nd way of working is good. It is common for data to be loaded into a DWH as is, and only to be transformed afterwards. The goal of this approach is often to not disturb the connection with the source for too long and having an easier time debugging.

But isn't it exactly what two separate Dataflows Gen1 can achieve? Without need for DWH?

v-cboorla-msft
Community Support
Community Support

Hi @iBusinessBI 

 

Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this.
We will update you once we hear back from them.


Thanks

Helpful resources

Announcements
Sept Fabric Carousel

Fabric Monthly Update - September 2024

Check out the September 2024 Fabric update to learn about new features.

Expanding the Data Factory Forums

New forum boards available in Data Factory

Ask questions in Apache Airflow Job and Mirroring.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.