Solved: Loading data with append option using dataflow and...

anusha_2023 · ‎09-06-2024

Working on the dataflows and have encountered a couple of challenges that I would like to seek your advice on.

Current Setup:
Initial Dataflow: I constructed a dataflow to load necessary tables via the On-Premises Gateway. Every day, we truncate and reload the tables, which is consuming significant time and capacity in Microsoft Fabric.
Incremental Dataflow: To address the load issue, I developed a second dataflow to capture only updated information based on the max date column from the Lakehouse table. This data is appended to the existing tables in the Lakehouse.
Problems:
Duplicate Rows: Despite implementing incremental loading, I am ending up with duplicate rows in the Lakehouse tables, which I am currently managing using a notebook for deduplication.

Deleted Rows: I need a way to track and handle deleted rows from the source, ensuring that the same deletions are reflected in the destination tables in the Lakehouse.

Proposed Solution:
I am considering setting up a third dataflow to extract only the primary keys from the source, loading them into a staging table in the Lakehouse. I would then handle the deletion logic inside a notebook by comparing the staging table against the existing data.

Request for Feedback:
Could you please provide your thoughts on the following:

Is this approach of using a third dataflow and notebook-based deletion tracking an optimal solution?
What are the best practices or more efficient ways to reduce computational overhead when handling updates and deletions in such scenarios?
Any guidance or suggestions on how to streamline this process would be highly appreciated.

Thank you for your time and support.

v-nuoc-msft · ‎09-08-2024

Hi @anusha_2023

Thank you very much frithjof_v and lbendlin for your prompt reply.

Your plan for handling stream updates and deletions is comprehensive.

Some suggestions for reducing computing overhead:

Make sure your Lakehouse table has a primary key constraint, which will prevent duplicate rows from being inserted.

Deduplication logic can be implemented directly in the data flow. Use Power Query to remove duplicates based on the primary key before loading the data into Lakehouse.

The separation of ETL processes into staging and transforming data streams can help optimize refresh times and reduce computational overhead.

Periodically monitor the performance of data streams and optimize queries to ensure efficient data processing.

Best practices for creating a dimensional model using dataflows - Power Query | Microsoft Learn

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

v-nuoc-msft · ‎09-09-2024

Hi @anusha_2023

Perhaps you can consider configuring incremental refreshes.

Make sure your data extraction process is incremental, meaning it only gets new records or changed records since the last load.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

v-nuoc-msft · ‎09-09-2024

Hi @anusha_2023

Perhaps you can consider configuring incremental refreshes.

Make sure your data extraction process is incremental, meaning it only gets new records or changed records since the last load.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

v-nuoc-msft · ‎09-08-2024

Hi @anusha_2023

Thank you very much frithjof_v and lbendlin for your prompt reply.

Your plan for handling stream updates and deletions is comprehensive.

Some suggestions for reducing computing overhead:

Make sure your Lakehouse table has a primary key constraint, which will prevent duplicate rows from being inserted.

Deduplication logic can be implemented directly in the data flow. Use Power Query to remove duplicates based on the primary key before loading the data into Lakehouse.

The separation of ETL processes into staging and transforming data streams can help optimize refresh times and reduce computational overhead.

Periodically monitor the performance of data streams and optimize queries to ensure efficient data processing.

Best practices for creating a dimensional model using dataflows - Power Query | Microsoft Learn

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

anusha_2023 · ‎09-09-2024

Hi,

Thanks for the input. I end up with three elements in my pipeline.

The first dataflow extracts the latest transactions and appends them to the table in the Lakhouse.

The second dataflow is getting the deleted transactions IDs or transactions in the source table and not in the destination lakehouse and saving them in the staging Lakehouse.

In the third step, Notebook is cleaning the deleted id's first and then deleting the duplicated rows based on the latest date field for the primarykey.

I thought of changing the first dataflow, before appending the new transactions into Lakhouse check for the duplication, but I cannot find the solution. Let me know if you find any further improvement steps in this process.

Thank you!

lbendlin · ‎09-10-2024

I think in the long run it would be better to use Parquet time travel, or proper CDC.

lbendlin · ‎09-06-2024

Deleted Rows: I need a way to track and handle deleted rows from the source, ensuring that the same deletions are reflected in the destination tables in the Lakehouse.

Power BI cannot modify individual rows. The lowest level available is the partition. You need to process the entire partition even if there is a change only in a single column/single row.

anusha_2023 · ‎09-06-2024

This is not inside Power BI. Loading data inside the Fabric environment workspace using dataflow gen2 through on-premises gateway connection.

frithjof_v · ‎09-06-2024

Can you use a Notebook and do Upsert (Merge) instead of using Dataflow Gen2?

https://delta.io/blog/2023-02-14-delta-lake-merge/

anusha_2023 · ‎09-06-2024

I need to get the current state of source primary keys. In order to get the data I do have only option connect through dataflow gen2 beacuse of on-premises data in IFS BI tool through on-premises gateway.

frithjof_v · ‎09-06-2024

Could you copy the data into a staging table by using Data Pipeline Copy Activity, and then use the Notebook to do the upsert?

Hopefully Data Pipeline supports on-prem gateway.

Or you could try to use the Dataflow Gen2 to write to a destination staging table, and then use the Notebook to do the upsert from the staging table into the final destination table.

Anyway, for doing upserts and handling deletes for a Lakehouse table, you will need to use Notebook. I think you are on the right track.

Dataflow and Data Pipeline can only do full overwrite or append. So for your need, you will need to include a Notebook (if full overwrite is not an option for you).

About the performance / resource utilization - I have no idea. I would just try it out and try to make it as efficient as possible.

Loading data with append option using dataflow and problem with the deleted rows

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024

Join us at the 2025 Microsoft Fabric Community Conference

Loading data with append option using dataflow and problem with the deleted rows

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024