Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Did you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now

Reply
Naveen96
Frequent Visitor

Incremental Refresh on Dataflow Gen 2 when we are using Replace at Data Destination in Lakehosue

Hi All,

I have a transaction table in a Dataflow Gen 2 on which I'm trying to enable incremental refresh. I first did a full refresh, then I set up the incremental refresh, but when I run it, it creates duplicate records for the date range of the incremental refresh.

My transaction table has Single CreatedDate Columns: Neither of these values ever change after the record is created. Daily basis we get new data but I don't want to reload all the data just I need to load only new data arrived. 

I have set up the incremental refresh as follows:

- Column to filter by: CreatedDate

- Extract data from the past: 1 days

- Bucket size: days

- Only extract new data when the max value in this column changes: CreatedDate

am writing the data to a Lakehouse, and the method is "Replace".

 

but I can't figure out how to get this to work without duplicating records. I assume that perhaps the issue is with Bucket where our data is getting duplicated for last day data if refresh is run mutliple times.

Below Configuration we are using for Incremental Refresh.

image.png

 

2 ACCEPTED SOLUTIONS
carlosmartins
Regular Visitor

Hi @Naveen96,

 

I think that you don't actually need the built-in incremental refresh feature at all. Microsoft has an official pattern called "incrementally amass data" that does exactly what you want, entirely within Dataflow Gen2, using Append mode. Here is how to set it up for your scenario 

 

Step 1: Make sure your full load is done

You mentioned you already did a full refresh with Replace. That is your baseline. If your Lakehouse table already has all the historical data, you're good to move to step 2.

 

Step 2: Disable the built-in incremental refresh

Open your dataflow, right click the query and turn off incremental refresh. You won't need it anymore.

 

Step 3: Create a helper query to get the max CreatedDate from your Lakehouse

Still inside the same dataflow, click Get Data and connect to your Lakehouse. Select the transaction table that already has your data. Rename this new query to something like MaxCreatedDate. Then right click the CreatedDate column in the data preview and select Drill Down. After that, go to the ribbon and click List Tools, then Statistics, then Maximum. This gives you a single value: the most recent CreatedDate in your Lakehouse. Finally, right click this query and select Disable Staging (this is important because it's a helper query, not a data load).

 

Step 4: Add a filter to your main source query

Go back to your main query (the one that pulls from your transaction source). Add a filter step on CreatedDate with the operation set to "is greater than" and the value set to the parameter MaxCreatedDate (your helper query). When prompted to allow combining data from the source and the Lakehouse, confirm it.

 

Step 5: Change the data destination to Append

Edit your data destination settings and switch the update method from Replace to Append. This works perfectly fine here because you are not using the built-in incremental refresh toggle anymore.

 

Step 6: Publish and run

Each time the dataflow runs, it reads the max CreatedDate from your Lakehouse, pulls only records with a CreatedDate greater than that value from your source, and appends them. No duplicates, no full reloads.

 

How to do a full reload if you ever need one

Since you're restricted to Dataflow Gen2, just edit the dataflow, remove the filter step you added in Step 4, change the destination back to Replace, publish and run it once. After that, re-edit the dataflow to add the filter back and switch to Append again.

 

One thing to keep in mind

This pattern works best if you run the dataflow once a day after all your daily data has landed. The filter uses "greater than" so it picks up records with a CreatedDate strictly newer than what's already in the Lakehouse. If your CreatedDate is a date without a time component and you run the dataflow multiple times on the same day, records from that same day that arrive after the first run would be missed. If that is a concern and your table has a unique auto-incrementing ID column, you could use that column instead of CreatedDate for the filter, which is actually what Microsoft uses in their tutorial.

(test on a dev environment first of course!)

 

The full tutorial from Microsoft is here: Pattern to incrementally amass data with Dataflow Gen2

 

Hope this helps!

 

Best regards,

Carlos

View solution in original post

Hi @Naveen96 ,

Thanks for the update.

I was able to reproduce the same behavior in my environment. Initially, I performed a full refresh which loaded all the data into the Lakehouse. After that, I enabled incremental refresh with a rolling window (past 1 day).

On teh next run, the same 1 day bucket was processed again, and since that data was already loaded during the full refresh, it resulted in duplicates.

vveshwaramsft_0-1777287500879.png

 

So this is not caused by the bucket itself, but by doing a full refresh first and then enabling incremental refresh with an overlapping window.

 

To avoid this, incremental refresh needs to be enabled from the beginning so that the data is loaded without overlap.

 

Also to clarify, when incremental refresh is enabled from the beginning, it does not load full historical data automatically. It only loads data based on the configured range, and the bucket size (for example, Day) just defines how that data is partitioned and refreshed.

 

So if your requirement is to have both historical data and incremental loads, you can configure it in one go using incremental refresh itself. Set “Extract data from the past” to cover your required history (for example, 1 year or more), keep the appropriate bucket size (for example, Day), and run the dataflow. The first run will load that historical range, and subsequent runs will continue incrementally within that range.

 

The duplication happens specifically when a full refresh is done first and then incremental refresh is enabled with a rolling window that overlaps with already loaded data.

 

Hope this helps. Please reach out for further assistance.
Thank you.

View solution in original post

10 REPLIES 10
carlosmartins
Regular Visitor

Hi @Naveen96,

 

I think that you don't actually need the built-in incremental refresh feature at all. Microsoft has an official pattern called "incrementally amass data" that does exactly what you want, entirely within Dataflow Gen2, using Append mode. Here is how to set it up for your scenario 

 

Step 1: Make sure your full load is done

You mentioned you already did a full refresh with Replace. That is your baseline. If your Lakehouse table already has all the historical data, you're good to move to step 2.

 

Step 2: Disable the built-in incremental refresh

Open your dataflow, right click the query and turn off incremental refresh. You won't need it anymore.

 

Step 3: Create a helper query to get the max CreatedDate from your Lakehouse

Still inside the same dataflow, click Get Data and connect to your Lakehouse. Select the transaction table that already has your data. Rename this new query to something like MaxCreatedDate. Then right click the CreatedDate column in the data preview and select Drill Down. After that, go to the ribbon and click List Tools, then Statistics, then Maximum. This gives you a single value: the most recent CreatedDate in your Lakehouse. Finally, right click this query and select Disable Staging (this is important because it's a helper query, not a data load).

 

Step 4: Add a filter to your main source query

Go back to your main query (the one that pulls from your transaction source). Add a filter step on CreatedDate with the operation set to "is greater than" and the value set to the parameter MaxCreatedDate (your helper query). When prompted to allow combining data from the source and the Lakehouse, confirm it.

 

Step 5: Change the data destination to Append

Edit your data destination settings and switch the update method from Replace to Append. This works perfectly fine here because you are not using the built-in incremental refresh toggle anymore.

 

Step 6: Publish and run

Each time the dataflow runs, it reads the max CreatedDate from your Lakehouse, pulls only records with a CreatedDate greater than that value from your source, and appends them. No duplicates, no full reloads.

 

How to do a full reload if you ever need one

Since you're restricted to Dataflow Gen2, just edit the dataflow, remove the filter step you added in Step 4, change the destination back to Replace, publish and run it once. After that, re-edit the dataflow to add the filter back and switch to Append again.

 

One thing to keep in mind

This pattern works best if you run the dataflow once a day after all your daily data has landed. The filter uses "greater than" so it picks up records with a CreatedDate strictly newer than what's already in the Lakehouse. If your CreatedDate is a date without a time component and you run the dataflow multiple times on the same day, records from that same day that arrive after the first run would be missed. If that is a concern and your table has a unique auto-incrementing ID column, you could use that column instead of CreatedDate for the filter, which is actually what Microsoft uses in their tutorial.

(test on a dev environment first of course!)

 

The full tutorial from Microsoft is here: Pattern to incrementally amass data with Dataflow Gen2

 

Hope this helps!

 

Best regards,

Carlos

v-veshwara-msft
Community Support
Community Support

Hi @Naveen96 ,

Thanks for reaching out to Microsoft Fabric Community.

The duplication is not caused by the bucket itself. With your current setup (past 1 day and daily bucket), each refresh reprocesses the same 1 day window.

 

Also, since CreatedDate is used for both filtering and change detection, the entire bucket gets refreshed whenever the max value changes. This can lead to duplicated data, especially if the same bucket is evaluated multiple times or data shifts between buckets.

Reference: Incremental refresh in Dataflow Gen2 - Microsoft Fabric | Microsoft Learn

Incremental refresh in Dataflow Gen2 - Microsoft Fabric | Microsoft Learn

 

To avoid this, use a different column for change detection if available (for example, a LastModifiedDate) and keep CreatedDate only for filtering. If such a column is not available, adjust the incremental range so the same time window is not reprocessed across runs.

 

Hope this helps. Please reach out for further assistance.
Thank you.

 

Hi @v-veshwara-msft 

 

In my Case I don't have other Different Columns like LastModifiedDate. did you have Step by step guide to adjust the incremental range so the same time window is not reprocessed across runs in dataflow gen2. 

Hi @Naveen96 ,

Thanks for the update.

I was able to reproduce the same behavior in my environment. Initially, I performed a full refresh which loaded all the data into the Lakehouse. After that, I enabled incremental refresh with a rolling window (past 1 day).

On teh next run, the same 1 day bucket was processed again, and since that data was already loaded during the full refresh, it resulted in duplicates.

vveshwaramsft_0-1777287500879.png

 

So this is not caused by the bucket itself, but by doing a full refresh first and then enabling incremental refresh with an overlapping window.

 

To avoid this, incremental refresh needs to be enabled from the beginning so that the data is loaded without overlap.

 

Also to clarify, when incremental refresh is enabled from the beginning, it does not load full historical data automatically. It only loads data based on the configured range, and the bucket size (for example, Day) just defines how that data is partitioned and refreshed.

 

So if your requirement is to have both historical data and incremental loads, you can configure it in one go using incremental refresh itself. Set “Extract data from the past” to cover your required history (for example, 1 year or more), keep the appropriate bucket size (for example, Day), and run the dataflow. The first run will load that historical range, and subsequent runs will continue incrementally within that range.

 

The duplication happens specifically when a full refresh is done first and then incremental refresh is enabled with a rolling window that overlaps with already loaded data.

 

Hope this helps. Please reach out for further assistance.
Thank you.

Hi @Naveen96 ,

Just wanted to check if the response provided was helpful. If further assistance is needed, please reach out.
Thank you.

Thanks, @v-veshwara-msft it was helpful.

Hi @Naveen96 ,

Thanks for the update. Could you please consider marking the helpful reply as Accepted solution as that would help others find the solution quickly.

If you need any further assistance, please reach out.
Thank you.

Tamanchu
Continued Contributor
Continued Contributor

Hi @Naveen96,

The issue is that Replace mode and Incremental Refresh are fundamentally incompatible in Dataflow Gen 2.

Here's what's happening under the hood :

  • Incremental refresh only pulls data matching your filter window (1 day in your case)
  • Replace mode wipes the entire destination table on every run, then writes whatever the dataflow just loaded

Result : each refresh → delete everything → write only the last 1 day → either data loss or, depending on timing and retries, duplicates.

Fix : switch to Append mode

In your Data Destination settings, change the write method from Replace to Append. Incremental refresh is designed to work with Append it loads only new records and appends them to what's already in the Lakehouse table.

Make sure your incremental refresh filter (CreatedDate > last max value) is tight enough to avoid re-loading already-written records.

If you occasionally need a full reload, do it manually by temporarily disabling incremental refresh and running a one-off Replace refresh then switch back to Append + incremental.

Reference: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-incremental-refresh

 

Hope this unblocks you!

Hi @Tamanchu,

I have tried to Change the Write Method from Replace to Append but Append is not allowing to use Incremental Refresh in Dataflow Gen2. Does you have any Step-by-Step Guide to achive this in Dataflow Gen2 as I Can you other but I am strictly restricted to DataFlow gen2.

Tamanchu
Continued Contributor
Continued Contributor

Hi @Naveen96,

 

You are right when you switch to Append mode, the built-in Incremental Refresh toggle becomes unavailable. This is by design. But you don't need it.

The approach I described earlier (incrementally amass data) replaces the built-in IR entirely. Here is the complete step-by-step for Dataflow Gen2 only:

 

Step 1 : Disable the built-in Incremental Refresh

Right-click your main query → Incremental Refresh → turn it off. You will not need it.

 

Step 2 : Create a watermark query (get the max date already loaded)

Inside the same dataflow, click Get Data → connect to your Lakehouse → select the destination table (the one you are writing to).

Once loaded:

  • Select only the CreatedDate column
  • Go to Transform → Statistics → Maximum
  • This gives you a single value: the last CreatedDate already in your table
  • Rename this query to LastLoadedDate
  • Right-click it → Enable load → OFF (it's just a helper, not a destination)


Step 3 : Filter your source query using the watermark

Go back to your main source query. Add a filter step:

= Table.SelectRows(PreviousStep, each [CreatedDate] > LastLoadedDate)

Or via the UI: Filter Rows → CreatedDate is after → reference LastLoadedDate.

 

Step 4 :  Set destination to Append

In Data Destination settings → change write method to Append.

 

Step 5 :  Run

Each refresh will:

Read the current max date from your Lakehouse
Load only records newer than that date
Append them no duplicates, no data loss
On the very first run after setup, make sure your Lakehouse table already has the historical data (which you said it does).

 

References :


Hope this helps!

Helpful resources

Announcements
April Fabric Update Carousel

Fabric Monthly Update - April 2026

Check out the April 2026 Fabric update to learn about new features.

Fabric SQL PBI Data Days

Data Days 2026 coming soon!

Sign up to receive a private message when registration opens and key events begin.

New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.