Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
anusha_2023
Helper III
Helper III

Loading data with append option using dataflow and problem with the deleted rows

Working on the dataflows and have encountered a couple of challenges that I would like to seek your advice on.

Current Setup:
Initial Dataflow: I constructed a dataflow to load necessary tables via the On-Premises Gateway. Every day, we truncate and reload the tables, which is consuming significant time and capacity in Microsoft Fabric.
Incremental Dataflow: To address the load issue, I developed a second dataflow to capture only updated information based on the max date column from the Lakehouse table. This data is appended to the existing tables in the Lakehouse.
Problems:
Duplicate Rows: Despite implementing incremental loading, I am ending up with duplicate rows in the Lakehouse tables, which I am currently managing using a notebook for deduplication.

Deleted Rows: I need a way to track and handle deleted rows from the source, ensuring that the same deletions are reflected in the destination tables in the Lakehouse.

Proposed Solution:
I am considering setting up a third dataflow to extract only the primary keys from the source, loading them into a staging table in the Lakehouse. I would then handle the deletion logic inside a notebook by comparing the staging table against the existing data.

Request for Feedback:
Could you please provide your thoughts on the following:

Is this approach of using a third dataflow and notebook-based deletion tracking an optimal solution?
What are the best practices or more efficient ways to reduce computational overhead when handling updates and deletions in such scenarios?
Any guidance or suggestions on how to streamline this process would be highly appreciated.

Thank you for your time and support.

2 ACCEPTED SOLUTIONS
v-nuoc-msft
Community Support
Community Support

Hi @anusha_2023 

 

Thank you very much frithjof_v and lbendlin for your prompt reply.

 

Your plan for handling stream updates and deletions is comprehensive.

 

Some suggestions for reducing computing overhead:

 

Make sure your Lakehouse table has a primary key constraint, which will prevent duplicate rows from being inserted.

 

Deduplication logic can be implemented directly in the data flow. Use Power Query to remove duplicates based on the primary key before loading the data into Lakehouse.

 

The separation of ETL processes into staging and transforming data streams can help optimize refresh times and reduce computational overhead.

 

Periodically monitor the performance of data streams and optimize queries to ensure efficient data processing.

 

Best practices for creating a dimensional model using dataflows - Power Query | Microsoft Learn

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

v-nuoc-msft
Community Support
Community Support

Hi @anusha_2023 

 

Perhaps you can consider configuring incremental refreshes.

 

Make sure your data extraction process is incremental, meaning it only gets new records or changed records since the last load.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

9 REPLIES 9
v-nuoc-msft
Community Support
Community Support

Hi @anusha_2023 

 

Perhaps you can consider configuring incremental refreshes.

 

Make sure your data extraction process is incremental, meaning it only gets new records or changed records since the last load.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

v-nuoc-msft
Community Support
Community Support

Hi @anusha_2023 

 

Thank you very much frithjof_v and lbendlin for your prompt reply.

 

Your plan for handling stream updates and deletions is comprehensive.

 

Some suggestions for reducing computing overhead:

 

Make sure your Lakehouse table has a primary key constraint, which will prevent duplicate rows from being inserted.

 

Deduplication logic can be implemented directly in the data flow. Use Power Query to remove duplicates based on the primary key before loading the data into Lakehouse.

 

The separation of ETL processes into staging and transforming data streams can help optimize refresh times and reduce computational overhead.

 

Periodically monitor the performance of data streams and optimize queries to ensure efficient data processing.

 

Best practices for creating a dimensional model using dataflows - Power Query | Microsoft Learn

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Hi,

 

Thanks for the input. I end up with three elements in my pipeline.

anusha_2023_0-1725935998657.png

The first dataflow extracts the latest transactions and appends them to the table in the Lakhouse.

The second dataflow is getting the deleted transactions IDs or transactions in the source table and not in the destination lakehouse and saving them in the staging Lakehouse.

In the third step, Notebook is cleaning the deleted id's first and then deleting the duplicated rows based on the latest date field for the primarykey.

 

I thought of changing the first dataflow, before appending the new transactions into Lakhouse check for the duplication, but I cannot find the solution. Let me know if you find any further improvement steps in this process.

 

Thank you!

 

 

I think in the long run it would be better to use Parquet time travel, or proper CDC.

lbendlin
Super User
Super User

Deleted Rows: I need a way to track and handle deleted rows from the source, ensuring that the same deletions are reflected in the destination tables in the Lakehouse.

Power BI cannot modify individual rows. The lowest level available is the partition. You need to process the entire partition even if there is a change only in a single column/single row.

This is not inside Power BI. Loading data inside the Fabric environment workspace using dataflow gen2 through on-premises gateway connection.

Can you use a Notebook and do Upsert (Merge) instead of using Dataflow Gen2?

 

https://delta.io/blog/2023-02-14-delta-lake-merge/

I need to get the current state of source primary keys. In order to get the data I do have only option connect through dataflow gen2 beacuse of on-premises data in IFS BI tool through on-premises gateway.

Could you copy the data into a staging table by using Data Pipeline Copy Activity, and then use the Notebook to do the upsert?

 

Hopefully Data Pipeline supports on-prem gateway.

 

Or you could try to use the Dataflow Gen2 to write to a destination staging table, and then use the Notebook to do the upsert from the staging table into the final destination table.

 

Anyway, for doing upserts and handling deletes for a Lakehouse table, you will need to use Notebook. I think you are on the right track.

 

Dataflow and Data Pipeline can only do full overwrite or append. So for your need, you will need to include a Notebook (if full overwrite is not an option for you).

 

About the performance / resource utilization - I have no idea. I would just try it out and try to make it as efficient as possible.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.

Top Solution Authors