This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
Hi All,
I have a transaction table in a Dataflow Gen 2 on which I'm trying to enable incremental refresh. I first did a full refresh, then I set up the incremental refresh, but when I run it, it creates duplicate records for the date range of the incremental refresh.
My transaction table has Single CreatedDate Columns: Neither of these values ever change after the record is created. Daily basis we get new data but I don't want to reload all the data just I need to load only new data arrived.
I have set up the incremental refresh as follows:
- Column to filter by: CreatedDate
- Extract data from the past: 1 days
- Bucket size: days
- Only extract new data when the max value in this column changes: CreatedDate
I am writing the data to a Lakehouse, and the method is "Replace".
but I can't figure out how to get this to work without duplicating records. I assume that perhaps the issue is with Bucket where our data is getting duplicated for last day data if refresh is run mutliple times.
Below Configuration we are using for Incremental Refresh.
Solved! Go to Solution.
Hi @Naveen96,
I think that you don't actually need the built-in incremental refresh feature at all. Microsoft has an official pattern called "incrementally amass data" that does exactly what you want, entirely within Dataflow Gen2, using Append mode. Here is how to set it up for your scenario
Step 1: Make sure your full load is done
You mentioned you already did a full refresh with Replace. That is your baseline. If your Lakehouse table already has all the historical data, you're good to move to step 2.
Step 2: Disable the built-in incremental refresh
Open your dataflow, right click the query and turn off incremental refresh. You won't need it anymore.
Step 3: Create a helper query to get the max CreatedDate from your Lakehouse
Still inside the same dataflow, click Get Data and connect to your Lakehouse. Select the transaction table that already has your data. Rename this new query to something like MaxCreatedDate. Then right click the CreatedDate column in the data preview and select Drill Down. After that, go to the ribbon and click List Tools, then Statistics, then Maximum. This gives you a single value: the most recent CreatedDate in your Lakehouse. Finally, right click this query and select Disable Staging (this is important because it's a helper query, not a data load).
Step 4: Add a filter to your main source query
Go back to your main query (the one that pulls from your transaction source). Add a filter step on CreatedDate with the operation set to "is greater than" and the value set to the parameter MaxCreatedDate (your helper query). When prompted to allow combining data from the source and the Lakehouse, confirm it.
Step 5: Change the data destination to Append
Edit your data destination settings and switch the update method from Replace to Append. This works perfectly fine here because you are not using the built-in incremental refresh toggle anymore.
Step 6: Publish and run
Each time the dataflow runs, it reads the max CreatedDate from your Lakehouse, pulls only records with a CreatedDate greater than that value from your source, and appends them. No duplicates, no full reloads.
How to do a full reload if you ever need one
Since you're restricted to Dataflow Gen2, just edit the dataflow, remove the filter step you added in Step 4, change the destination back to Replace, publish and run it once. After that, re-edit the dataflow to add the filter back and switch to Append again.
One thing to keep in mind
This pattern works best if you run the dataflow once a day after all your daily data has landed. The filter uses "greater than" so it picks up records with a CreatedDate strictly newer than what's already in the Lakehouse. If your CreatedDate is a date without a time component and you run the dataflow multiple times on the same day, records from that same day that arrive after the first run would be missed. If that is a concern and your table has a unique auto-incrementing ID column, you could use that column instead of CreatedDate for the filter, which is actually what Microsoft uses in their tutorial.
(test on a dev environment first of course!)
The full tutorial from Microsoft is here: Pattern to incrementally amass data with Dataflow Gen2
Hope this helps!
Best regards,
Carlos
Hi @Naveen96 ,
Thanks for the update.
I was able to reproduce the same behavior in my environment. Initially, I performed a full refresh which loaded all the data into the Lakehouse. After that, I enabled incremental refresh with a rolling window (past 1 day).
On teh next run, the same 1 day bucket was processed again, and since that data was already loaded during the full refresh, it resulted in duplicates.
So this is not caused by the bucket itself, but by doing a full refresh first and then enabling incremental refresh with an overlapping window.
To avoid this, incremental refresh needs to be enabled from the beginning so that the data is loaded without overlap.
Also to clarify, when incremental refresh is enabled from the beginning, it does not load full historical data automatically. It only loads data based on the configured range, and the bucket size (for example, Day) just defines how that data is partitioned and refreshed.
So if your requirement is to have both historical data and incremental loads, you can configure it in one go using incremental refresh itself. Set “Extract data from the past” to cover your required history (for example, 1 year or more), keep the appropriate bucket size (for example, Day), and run the dataflow. The first run will load that historical range, and subsequent runs will continue incrementally within that range.
The duplication happens specifically when a full refresh is done first and then incremental refresh is enabled with a rolling window that overlaps with already loaded data.
Hope this helps. Please reach out for further assistance.
Thank you.
Hi @Naveen96,
I think that you don't actually need the built-in incremental refresh feature at all. Microsoft has an official pattern called "incrementally amass data" that does exactly what you want, entirely within Dataflow Gen2, using Append mode. Here is how to set it up for your scenario
Step 1: Make sure your full load is done
You mentioned you already did a full refresh with Replace. That is your baseline. If your Lakehouse table already has all the historical data, you're good to move to step 2.
Step 2: Disable the built-in incremental refresh
Open your dataflow, right click the query and turn off incremental refresh. You won't need it anymore.
Step 3: Create a helper query to get the max CreatedDate from your Lakehouse
Still inside the same dataflow, click Get Data and connect to your Lakehouse. Select the transaction table that already has your data. Rename this new query to something like MaxCreatedDate. Then right click the CreatedDate column in the data preview and select Drill Down. After that, go to the ribbon and click List Tools, then Statistics, then Maximum. This gives you a single value: the most recent CreatedDate in your Lakehouse. Finally, right click this query and select Disable Staging (this is important because it's a helper query, not a data load).
Step 4: Add a filter to your main source query
Go back to your main query (the one that pulls from your transaction source). Add a filter step on CreatedDate with the operation set to "is greater than" and the value set to the parameter MaxCreatedDate (your helper query). When prompted to allow combining data from the source and the Lakehouse, confirm it.
Step 5: Change the data destination to Append
Edit your data destination settings and switch the update method from Replace to Append. This works perfectly fine here because you are not using the built-in incremental refresh toggle anymore.
Step 6: Publish and run
Each time the dataflow runs, it reads the max CreatedDate from your Lakehouse, pulls only records with a CreatedDate greater than that value from your source, and appends them. No duplicates, no full reloads.
How to do a full reload if you ever need one
Since you're restricted to Dataflow Gen2, just edit the dataflow, remove the filter step you added in Step 4, change the destination back to Replace, publish and run it once. After that, re-edit the dataflow to add the filter back and switch to Append again.
One thing to keep in mind
This pattern works best if you run the dataflow once a day after all your daily data has landed. The filter uses "greater than" so it picks up records with a CreatedDate strictly newer than what's already in the Lakehouse. If your CreatedDate is a date without a time component and you run the dataflow multiple times on the same day, records from that same day that arrive after the first run would be missed. If that is a concern and your table has a unique auto-incrementing ID column, you could use that column instead of CreatedDate for the filter, which is actually what Microsoft uses in their tutorial.
(test on a dev environment first of course!)
The full tutorial from Microsoft is here: Pattern to incrementally amass data with Dataflow Gen2
Hope this helps!
Best regards,
Carlos
Hi @Naveen96 ,
Thanks for reaching out to Microsoft Fabric Community.
The duplication is not caused by the bucket itself. With your current setup (past 1 day and daily bucket), each refresh reprocesses the same 1 day window.
Also, since CreatedDate is used for both filtering and change detection, the entire bucket gets refreshed whenever the max value changes. This can lead to duplicated data, especially if the same bucket is evaluated multiple times or data shifts between buckets.
Reference: Incremental refresh in Dataflow Gen2 - Microsoft Fabric | Microsoft Learn
Incremental refresh in Dataflow Gen2 - Microsoft Fabric | Microsoft Learn
To avoid this, use a different column for change detection if available (for example, a LastModifiedDate) and keep CreatedDate only for filtering. If such a column is not available, adjust the incremental range so the same time window is not reprocessed across runs.
Hope this helps. Please reach out for further assistance.
Thank you.
In my Case I don't have other Different Columns like LastModifiedDate. did you have Step by step guide to adjust the incremental range so the same time window is not reprocessed across runs in dataflow gen2.
Hi @Naveen96 ,
Thanks for the update.
I was able to reproduce the same behavior in my environment. Initially, I performed a full refresh which loaded all the data into the Lakehouse. After that, I enabled incremental refresh with a rolling window (past 1 day).
On teh next run, the same 1 day bucket was processed again, and since that data was already loaded during the full refresh, it resulted in duplicates.
So this is not caused by the bucket itself, but by doing a full refresh first and then enabling incremental refresh with an overlapping window.
To avoid this, incremental refresh needs to be enabled from the beginning so that the data is loaded without overlap.
Also to clarify, when incremental refresh is enabled from the beginning, it does not load full historical data automatically. It only loads data based on the configured range, and the bucket size (for example, Day) just defines how that data is partitioned and refreshed.
So if your requirement is to have both historical data and incremental loads, you can configure it in one go using incremental refresh itself. Set “Extract data from the past” to cover your required history (for example, 1 year or more), keep the appropriate bucket size (for example, Day), and run the dataflow. The first run will load that historical range, and subsequent runs will continue incrementally within that range.
The duplication happens specifically when a full refresh is done first and then incremental refresh is enabled with a rolling window that overlaps with already loaded data.
Hope this helps. Please reach out for further assistance.
Thank you.
Hi @Naveen96 ,
Just wanted to check if the response provided was helpful. If further assistance is needed, please reach out.
Thank you.
Hi @Naveen96 ,
Thanks for the update. Could you please consider marking the helpful reply as Accepted solution as that would help others find the solution quickly.
If you need any further assistance, please reach out.
Thank you.
Hi @Naveen96,
The issue is that Replace mode and Incremental Refresh are fundamentally incompatible in Dataflow Gen 2.
Here's what's happening under the hood :
Result : each refresh → delete everything → write only the last 1 day → either data loss or, depending on timing and retries, duplicates.
Fix : switch to Append mode
In your Data Destination settings, change the write method from Replace to Append. Incremental refresh is designed to work with Append it loads only new records and appends them to what's already in the Lakehouse table.
Make sure your incremental refresh filter (CreatedDate > last max value) is tight enough to avoid re-loading already-written records.
If you occasionally need a full reload, do it manually by temporarily disabling incremental refresh and running a one-off Replace refresh then switch back to Append + incremental.
Reference: https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-incremental-refresh
Hope this unblocks you!
Hi @Tamanchu,
I have tried to Change the Write Method from Replace to Append but Append is not allowing to use Incremental Refresh in Dataflow Gen2. Does you have any Step-by-Step Guide to achive this in Dataflow Gen2 as I Can you other but I am strictly restricted to DataFlow gen2.
Hi @Naveen96,
You are right when you switch to Append mode, the built-in Incremental Refresh toggle becomes unavailable. This is by design. But you don't need it.
The approach I described earlier (incrementally amass data) replaces the built-in IR entirely. Here is the complete step-by-step for Dataflow Gen2 only:
Step 1 : Disable the built-in Incremental Refresh
Right-click your main query → Incremental Refresh → turn it off. You will not need it.
Step 2 : Create a watermark query (get the max date already loaded)
Inside the same dataflow, click Get Data → connect to your Lakehouse → select the destination table (the one you are writing to).
Once loaded:
Step 3 : Filter your source query using the watermark
Go back to your main source query. Add a filter step:
= Table.SelectRows(PreviousStep, each [CreatedDate] > LastLoadedDate)
Or via the UI: Filter Rows → CreatedDate is after → reference LastLoadedDate.
Step 4 : Set destination to Append
In Data Destination settings → change write method to Append.
Step 5 : Run
Each refresh will:
Read the current max date from your Lakehouse
Load only records newer than that date
Append them no duplicates, no data loss
On the very first run after setup, make sure your Lakehouse table already has the historical data (which you said it does).
References :
Hope this helps!
Check out the April 2026 Fabric update to learn about new features.
Sign up to receive a private message when registration opens and key events begin.
| User | Count |
|---|---|
| 11 | |
| 10 | |
| 6 | |
| 6 | |
| 5 |
| User | Count |
|---|---|
| 29 | |
| 16 | |
| 12 | |
| 10 | |
| 7 |