March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Working on the dataflows and have encountered a couple of challenges that I would like to seek your advice on.
Current Setup:
Initial Dataflow: I constructed a dataflow to load necessary tables via the On-Premises Gateway. Every day, we truncate and reload the tables, which is consuming significant time and capacity in Microsoft Fabric.
Incremental Dataflow: To address the load issue, I developed a second dataflow to capture only updated information based on the max date column from the Lakehouse table. This data is appended to the existing tables in the Lakehouse.
Problems:
Duplicate Rows: Despite implementing incremental loading, I am ending up with duplicate rows in the Lakehouse tables, which I am currently managing using a notebook for deduplication.
Deleted Rows: I need a way to track and handle deleted rows from the source, ensuring that the same deletions are reflected in the destination tables in the Lakehouse.
Proposed Solution:
I am considering setting up a third dataflow to extract only the primary keys from the source, loading them into a staging table in the Lakehouse. I would then handle the deletion logic inside a notebook by comparing the staging table against the existing data.
Request for Feedback:
Could you please provide your thoughts on the following:
Is this approach of using a third dataflow and notebook-based deletion tracking an optimal solution?
What are the best practices or more efficient ways to reduce computational overhead when handling updates and deletions in such scenarios?
Any guidance or suggestions on how to streamline this process would be highly appreciated.
Thank you for your time and support.
Solved! Go to Solution.
Hi @anusha_2023
Thank you very much frithjof_v and lbendlin for your prompt reply.
Your plan for handling stream updates and deletions is comprehensive.
Some suggestions for reducing computing overhead:
Make sure your Lakehouse table has a primary key constraint, which will prevent duplicate rows from being inserted.
Deduplication logic can be implemented directly in the data flow. Use Power Query to remove duplicates based on the primary key before loading the data into Lakehouse.
The separation of ETL processes into staging and transforming data streams can help optimize refresh times and reduce computational overhead.
Periodically monitor the performance of data streams and optimize queries to ensure efficient data processing.
Best practices for creating a dimensional model using dataflows - Power Query | Microsoft Learn
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @anusha_2023
Perhaps you can consider configuring incremental refreshes.
Make sure your data extraction process is incremental, meaning it only gets new records or changed records since the last load.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @anusha_2023
Perhaps you can consider configuring incremental refreshes.
Make sure your data extraction process is incremental, meaning it only gets new records or changed records since the last load.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @anusha_2023
Thank you very much frithjof_v and lbendlin for your prompt reply.
Your plan for handling stream updates and deletions is comprehensive.
Some suggestions for reducing computing overhead:
Make sure your Lakehouse table has a primary key constraint, which will prevent duplicate rows from being inserted.
Deduplication logic can be implemented directly in the data flow. Use Power Query to remove duplicates based on the primary key before loading the data into Lakehouse.
The separation of ETL processes into staging and transforming data streams can help optimize refresh times and reduce computational overhead.
Periodically monitor the performance of data streams and optimize queries to ensure efficient data processing.
Best practices for creating a dimensional model using dataflows - Power Query | Microsoft Learn
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi,
Thanks for the input. I end up with three elements in my pipeline.
The first dataflow extracts the latest transactions and appends them to the table in the Lakhouse.
The second dataflow is getting the deleted transactions IDs or transactions in the source table and not in the destination lakehouse and saving them in the staging Lakehouse.
In the third step, Notebook is cleaning the deleted id's first and then deleting the duplicated rows based on the latest date field for the primarykey.
I thought of changing the first dataflow, before appending the new transactions into Lakhouse check for the duplication, but I cannot find the solution. Let me know if you find any further improvement steps in this process.
Thank you!
I think in the long run it would be better to use Parquet time travel, or proper CDC.
Deleted Rows: I need a way to track and handle deleted rows from the source, ensuring that the same deletions are reflected in the destination tables in the Lakehouse.
Power BI cannot modify individual rows. The lowest level available is the partition. You need to process the entire partition even if there is a change only in a single column/single row.
This is not inside Power BI. Loading data inside the Fabric environment workspace using dataflow gen2 through on-premises gateway connection.
Can you use a Notebook and do Upsert (Merge) instead of using Dataflow Gen2?
I need to get the current state of source primary keys. In order to get the data I do have only option connect through dataflow gen2 beacuse of on-premises data in IFS BI tool through on-premises gateway.
Could you copy the data into a staging table by using Data Pipeline Copy Activity, and then use the Notebook to do the upsert?
Hopefully Data Pipeline supports on-prem gateway.
Or you could try to use the Dataflow Gen2 to write to a destination staging table, and then use the Notebook to do the upsert from the staging table into the final destination table.
Anyway, for doing upserts and handling deletes for a Lakehouse table, you will need to use Notebook. I think you are on the right track.
Dataflow and Data Pipeline can only do full overwrite or append. So for your need, you will need to include a Notebook (if full overwrite is not an option for you).
About the performance / resource utilization - I have no idea. I would just try it out and try to make it as efficient as possible.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
3 | |
2 | |
2 | |
2 | |
1 |