Solved: Dataflow vs. Dataflow Gen2 vs. Warehouse

iBusinessBI · ‎12-06-2023

I have the following scenario:

1. Step 1 - I want to connect to Salesforce and get an Opportinities table AS-IS

2. I want to connect to the table from Step 1 and do all kinds of complex transformations with Power Query.

My task is to make an API request JUST ONCE to Salesforce. All the rest load should be done inside Microsoft capacity. So I want to separate the two steps most efficiently.

(When I do everything inside ONE Dataflow - the second step is sending API requests again and again and again to Salesforce so it is not an option)

I have created 2 dataflows and the second one refers to the first (it created a "linked" entity with enhances mode, watever it means). And I've created a Power Automate flow so the second dataflow will refresh after the first is finished.

But it still takes a long time for the second one to run. And I wonder whether it still propagates the API calls to Salesforce because of "linked entity".

I am thinking to test another alternative:

Create a Gen2 Dataflow, save the initial "AS-IS" table in a Warehouse/or Lakehouse.
Then, connect the second Dataflow to it. And schedule everything with Pipeline.

Should there be differences in this approach?

Or is it more or less the same as just separating two Gen1 dataflows?

What does the theory tell?

LuitwielerMSFT · ‎12-07-2023

Thanks for the follow up!

The output can become anything you want! I believe you needed to trigger an API in salesforce before the refresh starts, if that is not the case you can ignore this activity in the pipeline and straight away run the first dataflow to consume from salesforce and load it into your lakehouse
Some advantages compared to gen1 are:
1. You can decide where to save the data with data destinations feature in gen2, which can easily be accessed by other means. So you are not restricted to the dataflow connector anylonger as you were with dataflows gen1.
2. Bringing in the data raw into your lakehouse allows you to build different dataflows on top of it for experimenting or other business needs without putting load on your salesforce instance multiple times.
3. Gen2 compared to Gen1 the performance for complex transformations will be better as we try to leverage the compute endpoints of the staging lakehouse and warehouse which in many cases can be way faster than the native PowerQuery engine when working with large datasets.

Hope this clarifies 😁

View solution in original post

v-cboorla-msft · ‎12-13-2023

Hi @iBusinessBI

Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.

LuitwielerMSFT · ‎12-07-2023

Hi BusinessBI!

My name is Jeroen Luitwieler and a Senior Product Manager on the Dataflow/dataintegration team.

Looking at your case you may want to consider the following:

Steps:

Create a lakehouse in your workspace
Create a pipeline in your fabric workspace
- Add a Web activity to sent a http request to salesforce
- On success, run the first Dataflow Gen2 that:
  - Raw ingest all the tables you are interested in
  - Use data destination to load the tables into the lakehouse
  - Disable staging for all tables to improve performance
- On success of that Dataflow, run another dataflow that:
  - Consumes the ingested data from the lakehouse
  - Apply the transformations to the data
  - Write the results of your transformations into your data destination of your liking
Apply the schedule to the data pipeline that fits your business needs

Would this cover your usecase? do you have any questions or concerns with this solution?

iBusinessBI · ‎12-07-2023

Thanks, @LuitwielerMSFT Jeroen,

That's interesting.... Some thoughts:

1. Where does the Web activity saves its output to? (What is the source for the "Ingest to lakehouse" DF?)
2. What is the advantage of saving the intermediate results to a Lakehouse?
I could just build 2 regular Gen1 Dataflows:

- the first connects to the Salesforce and brings all the table in the raw format as-is.

- the second DF connects to the first and does all the necessary transformation.

So what advantage I get working with the Lakehouse and Gen2 DF, instead of just 2 regular Gen1 DF?
Thanks

LuitwielerMSFT · ‎12-07-2023

Thanks for the follow up!

The output can become anything you want! I believe you needed to trigger an API in salesforce before the refresh starts, if that is not the case you can ignore this activity in the pipeline and straight away run the first dataflow to consume from salesforce and load it into your lakehouse
Some advantages compared to gen1 are:
1. You can decide where to save the data with data destinations feature in gen2, which can easily be accessed by other means. So you are not restricted to the dataflow connector anylonger as you were with dataflows gen1.
2. Bringing in the data raw into your lakehouse allows you to build different dataflows on top of it for experimenting or other business needs without putting load on your salesforce instance multiple times.
3. Gen2 compared to Gen1 the performance for complex transformations will be better as we try to leverage the compute endpoints of the staging lakehouse and warehouse which in many cases can be way faster than the native PowerQuery engine when working with large datasets.

Hope this clarifies 😁

Anonymous · ‎12-07-2023

The theory tells you that the 2nd way of working is good. It is common for data to be loaded into a DWH as is, and only to be transformed afterwards. The goal of this approach is often to not disturb the connection with the source for too long and having an easier time debugging.

iBusinessBI · ‎12-07-2023

But isn't it exactly what two separate Dataflows Gen1 can achieve? Without need for DWH?

v-cboorla-msft · ‎12-07-2023

Hi @iBusinessBI

Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this.
We will update you once we hear back from them.

Thanks