March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi,
I have a SQL Source of which, for this example, I want to load Table A and Table B, but also want to add some columns from Table B to Table A via a Merge.
First workflow I tried:
1. Staging Dataflow that loads Table A and Table B without further transformation (except for some 'remove other columns' )
2. Set up Incremental Refresh on both tables, with partitioning on month level and detect data changes.
3. New Dataflow that gets tables from first staging dataflow as linked tables, Enhanced Compute Engine set to On.
4, Merge Table A and B into new Calculated Table, this is using the SQL endpoint of the staging dataflow I believe.
5. Set up Incremental Refresh on the new Calculated Table, with partitioning on month level and detect data changes.
I figured this should be the fastest as Staging dataflow makes sure that both tables are up-to-date and then the merge is done with the enhanced compute engine on the dataflow instead of the source. But it turns out slower than the following workflow:
1. Single dataflow that loads table A and table B but disabled Load on Table B
2. Merge Table B with Table A
3. Load table B again with enabled Load
3. Set incremental refresh on Table A and B with detect data changes.
In this case the merge is queried at the source. This is much faster, but it does not account for changes in Table B. But if I enable load on Table B then the merged table becomes a calculated table again on a linked table and the same happens as the earlier workflow. That's why I query it twice but once loaded and once not loaded.
When dealing with merges and working with incremental refresh and detect data changes, what is the proper way? I want to minimize data load while keeping both tables up-to-date. From the documentation the workflow with a staging dataflow and then a transform dataflow for any further ETL such as joins was recommended, but as stated it's 2 to 3 times slower.
The proper workflow for this is not to do the merge. Let the data model do the work for you.
Fair point and that's usually what I do, so this is an exception. Thus I wanted to do it as far upstream as possible (can't change the views so that's where it ends) rather than a calculated column in DAX.
In my personal opinion a dataflow is only useful to shield the developers (NOT the report users) from a slow data source. If your data source is not slow then eliminate the dataflow and use a dataset with incremental refresh.
I plan on using the dataflow in multiple datasets, that's the reason I choose a dataflow.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
37 | |
22 | |
20 | |
10 | |
9 |
User | Count |
---|---|
59 | |
55 | |
22 | |
14 | |
12 |