Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Fabric is Generally Available. Browse Fabric Presentations. Work towards your Fabric certification with the Cloud Skills Challenge.

Reply
Scott_Powell
Advocate II
Advocate II

Dataflow Gen 2 - how to join tables and load results, without loading initial tables?

Hi - I have what I think should be a simple thing to do with Dataflows Gen 2 (it could easily be done in Gen1 dataflows), but can't figure out how. What I'd like to do - load all records from the current fiscal year from an on-prem data warehouse fact table into Fabric.

 

With dataflows gen1 this was easy - I'd add our Date dim and the fact table from the SQL server table list, and then set both of those tables to NOT load. Then I'd create a reference to the date table, filter it to current fiscal year (using a flag on the date table), and "merge": it to the fact table. The resulting table from this would then be loaded into a dataset or dataflow.

 

But with Gen2 dataflows, there isn't the option anymore to "disable" load - it's been replaced by the "Use staging" option or whatever it's called. If I pull in my date dim and fact table as before - I can still reference them and create an output table that's appropriate filtered, but I don't seem to have a way to prevent the original two tables from loading - meaning the ENTIRE fact and dim tables will load, in addition to the resulting table I actually want.

 

One potential alternative that might be suggested but is not acceptable - hard coding a date range and directly filtering the fact table without using a join to the dimension table. This is unacceptable for two reasons: 1) our date table contains logic for things like "current fiscal year", and I want that to dynamically drive which rows get returned. 2) what if I want to (for example) filter down to a specific department instead of date filtering? We use surrogate keys, so the ONLY way to filter the fact table by department is to do the join.

 

Appreciate any ideaas on this - it seems like we've lost basic functionality, but maybe there's a workaround I don't know.

 

Thanks!

Scott

1 ACCEPTED SOLUTION
DennesTorres
Responsive Resident
Responsive Resident

Hi,

I understand you are trying to load the tables from on-prem to a lakehouse/data warehouse on Fabric.

The option to disable the load we had before was very useful because the dataflow was a starting point to the modeling. Once created, the datasets created over the dataflow would "see" the loaded tables and many times we wouldn't like that. But this doesn't mean that during the refresh processing the tables were not loaded into memory to produce the new one you "modelled".

 

The Dataflow Gen 2, on the other hand, is used to load the data into the lake, it's not a modelling starting point. In this way, you don't need to bother about "loading" the table. For each table you would like to load into the data lake, you need to set a destination.

So, after merging the tables the way you would like, just set the destination on the resulting table, ignoring the others, and only the resulting table will be loaded into the lake, where in fact the modelling will happen. The "work tables" created in the middle of the ETL will not be on the lake.

But nothing changes in relation to how heavy the processing is. If your concern is about the processing, maybe you would like to consider incremental load options.

Kind Regards,

 

Dennes

View solution in original post

5 REPLIES 5
Scott_Powell
Advocate II
Advocate II

Hi all - sorry I've been tied up and not able to respond in a timely manner. My original post maybe wasn't clear - this topic is NOT about staging vs. not staging. This is about being able to fold a query down to the on-prem server and bring back slices of data instead of entire tables.

 

I need to be able to filter a dimension table (date, department, etc.) to a subset of values, and the ONLY bring back the matching rows from the fact table. 

 

In dataflow gen1 this was easy - you set the source dimension table and fact table to "disable load", then make references to them (with load enabled) where you apply the filters and bring back only the rows you need. That seems to be no longer possible - the "source" copies can't be stopped from loading, so it ends up pulling down the billions of rows vs. just the small subset I want.

 

I'll continue playing and see if I can come up with a workaround.

 

Thanks,

Scott

Hi, Scott,

 

It took me a while to understand what you mean.

 

In summary, you are referring to the fact that sometimes, when you disable the load of some dimensions, the query for the fact had the opportunity to fold, becoming a single query sent to the source.

 

I will put a lot of guess on this, but I think the folding opportunity was a side effect of the disable load. The disable load was mainly intended to not save the table on the result, if it could fold the query or not, was a side effect.

 

The problem is that you are right: It was a very nice side effect and we are missing it. 

 

I suggest registering this as an issue on https://community.fabric.microsoft.com/t5/Issues/idb-p/Issues . But this is a complex one. You will need a very good example and step by step to illustrate this as an issue.

 

Kind Regards,

 

Dennes

Hi @DennesTorres , I don't think in my case it was a side effect - it's literally critical for anyone who wants or needs to load only slices of data instead of entire tables. Note I'm documenting below a very simple example, mostly so I can link to this post when I enter the issue as you suggested (which is the proper way to go - thanks!)

 

Below is a very simple example of a dataset (I work for a hospital system). The steps I go through to build any dataset like this is (note - see picture at bottom for final result):

 

Step 1: Bring in SQL Server table definitions for all tables that will be in use (our DW is on-prem SQL server). Set all of those tables to DISABLE LOAD and store them in a "sources" group

 

Step 2: Create a "fact driver" staging table that focuses on the set of data I'm interested in. In this case I start with the DateDim src table, filter it to the set of dates needed, and then join this to the hospital admissions fact source table. This allows me to subset the original fact table - which is functionality we seem to have lost. This also completely folds to SQL server - so ONLY the records I want are pulled across. Note that the staging tables also have load disabled (to support step 3c below)

 

Step 3a: Create the "final model" date dim - by using the original Date Dim and applying the same filtering logic used in the staging fact table.

 

Step 3b: Create the final model fact table - really just a reference to the staging fact table, but with load ENABLED

 

Step 3c: Create the remaining "non-driver" dimension tables - but only containing values that actually show up in the staging fact table. For example with PatientDim, I start with the staging fact table, remove all columns except the PatientKey column, deduplicate it (because a patient may have > 1 hospital admission), then join it to the source PatientDim table. This results in a final model PatientDim that only has the subset of patients that actually had a hospital admission. This also cleanly folds back to SQL Server so it's very fast. In our case, the source patient dim has 14.6 million rows (type 2 dim), but the final Patient Dim has only about 600,000 rows. We also seem to have lost this capability.

 

Interested in whether this seems clear or if there's a better way I could phrase this to make it easier for the issues team to understand.

 

Thank you!

Scott

 

Picture of final dataset:

Scott_Powell_1-1694055675773.png

 

 

miguel
Community Admin
Community Admin

Hi!

 

I'd highly recommend checking out the blog that we have around Dataflows Gen2 to better understand when to use the "Enable Staging" feature and how it works:

https://blog.fabric.microsoft.com/blog/data-factory-spotlight-dataflows-gen2

DennesTorres
Responsive Resident
Responsive Resident

Hi,

I understand you are trying to load the tables from on-prem to a lakehouse/data warehouse on Fabric.

The option to disable the load we had before was very useful because the dataflow was a starting point to the modeling. Once created, the datasets created over the dataflow would "see" the loaded tables and many times we wouldn't like that. But this doesn't mean that during the refresh processing the tables were not loaded into memory to produce the new one you "modelled".

 

The Dataflow Gen 2, on the other hand, is used to load the data into the lake, it's not a modelling starting point. In this way, you don't need to bother about "loading" the table. For each table you would like to load into the data lake, you need to set a destination.

So, after merging the tables the way you would like, just set the destination on the resulting table, ignoring the others, and only the resulting table will be loaded into the lake, where in fact the modelling will happen. The "work tables" created in the middle of the ETL will not be on the lake.

But nothing changes in relation to how heavy the processing is. If your concern is about the processing, maybe you would like to consider incremental load options.

Kind Regards,

 

Dennes

Helpful resources

Announcements
November Fabric 2023 Webinars

Fabric Monthly Update - November 2023

Check out the November 2023 Fabric update to learn about new features.

Power BI Fabric Summit Carousel

The largest Power BI and Fabric virtual conference

130+ sessions, 130+ speakers, Product managers, MVPs, and experts. All about Power BI and Fabric. Attend online or watch the recordings.