Solved: Re: Power BI Incremental refresh not removing dupl...

AnnOminous · ‎07-23-2025

I'm trying to implement an incremental refresh in PBI, but it's failing because of duplicate values in the 1 side of a 1-to-many table relationship. I need to retain all of the data from the previous two years, and then append incremental updates to that data.

The data source is an API call, which takes date ranges as parameters. The maximum date range the API accepts is six months. To make this work, I've created a table that contains the list of dates for refresh, and the corresponding date parameters that looks like this (to be clear, "Today +/- x" values are written in the YYYY-MM-DD format):

Date	dtFrom	dtTo
Today - 3	2024-07-01T00:00:00	2024-12-31T23:59:59
Today - 2	2025-01-01T00:00:00	2025-06-30T23:59:59
Today - 1	2025-07-01T00:00:00	2025-12-31T23:59:59
Today	TodayT00:00:00	TodayT23:59:59
Today + 1	Today - 1T00:00:00	Today - 1T23:59:59
Today + 2	Today - 2T00:00:00	Today - 2T23:59:59
Today + 3	Today - 3T00:00:00	Today - 3T23:59:59
Today + 4	Today - 4T00:00:00	Today - 4T23:59:59
Today + 5	Today - 5T00:00:00	Today - 5T23:59:59
Today + 6	Today - 6T00:00:00	Today - 6T23:59:59
Today + 7	Today - 7T00:00:00	Today - 7T23:59:59
Today + 8	Today - 8T00:00:00	Today - 8T23:59:59
Today + 9	Today - 9T00:00:00	Today - 9T23:59:59
Today + 10	Today - 10T00:00:00	Today - 10T23:59:59
Today + 11	Today - 11T00:00:00	Today - 11T23:59:59
Today + 12	Today - 12T00:00:00	Today - 12T23:59:59

I transform the Date column to type datetime and the dtFrom and dtTo columns to type text.

I created a function which calls the API and takes Date as it's input variable, and then apply that function to the table above.

Next, I filter the function table useing the RangeStart (Today - 3) and RangeEnd (Today) parameters:

#"Filter dates" = Table.SelectRows(#"previous step", each [Date] >=RangeStart and [Date] < RangeEnd

This leaves me with the following table:

Date	dtFrom	dtTo	Call API function
Today - 3	2024-07-01T00:00:00	2024-12-31T23:59:59	call_API
Today - 2	2025-01-01T00:00:00	2025-06-30T23:59:59	call_API
Today - 1	2025-07-01T00:00:00	2025-12-31T23:59:59	call_API

After expanding the results of the function call, I remove duplicate records based on the value of the user_id field:
#"Remove duplicate users" = Table.Distinct(#"previous step", {"user_id"})

For the incremental refresh, I set the following values:

When I run a refresh in the desktop, there is no issue with duplicate IDs, which I expect, since the desktop refresh is still just using the date ranges in the above table. However, when I publish the report to the service, it fails, with the message:

table contains duplicate user_id xxxxxxxx, which is not allowed in the one side of a one-to-many relationship.

My assumption is that this is happening because the incremental data includes user IDs from the previous two years (if someone updates their user profile, a new record is added to the user table with all the same fields, and that includes both the previous field values and any updated field values), so that while duplicate user IDs are removed from the "master" table and the incremental data as part of the refresh, the remove duplicate IDs step is only applied to each of these tables separately, and is not/cannot be applied to the fully updated table (that is, the "master" table with the incremental data appended), so duplicate IDs are now in the full table.

If the above assumption is correct, is there a way to remove duplicate IDs after the incremental data has been appended to the "master" table? Or is it possibly another issue that I can try to resolve?

v-echaithra · ‎07-24-2025

Hi @AnnOminous ,

Thank you @lbendlin for your inputs, You're absolutely right Power BI Incremental Refresh is designed for append only scenarios based on an immutable datetime column, such as CreatedDate. It's not intended to handle differential updates or Slowly Changing Dimensions Type 2 (SCD2), where rows may change over time and retain history using LastModifiedDate.

In this case, since duplicate user_ids are being introduced due to updated records appearing in new partitions, the only reliable solution is to reprocess all partitions that may be affected by those changes. This ensures that deduplication logic can be applied across the entire dataset, not just within each partition.

If full reprocessing isn't feasible or efficient, a common workaround is to:

Load all user records including duplicatesinto your Users table, Then use a DAX calculated table or virtual relationship to filter for only the latest record per user_id based on a ModifiedDate or EffectiveDate field. This way, you maintain the integrity of incremental refresh while ensuring only the latest version of each user is used in your model relationships and visuals.

Hope this helps.
Best Regards,
Chaithra E.

View solution in original post

v-echaithra · ‎07-28-2025

Hi @AnnOminous ,

Yes, reprocessing every partition is effectively the same as performing a full refresh of the table, ensuring that your deduplication logic is applied across the entire dataset, not just within individual partitions.

If you're using Power BI Premium or Premium Per User (PPU) and have XMLA endpoints enabled, you can manage this through SQL Server Management Studio (SSMS) without needing to republish the dataset. In SSMS, you can connect to your Power BI workspace, navigate to the relevant table, and use the "Process Full" option to reprocess all partitions. This will reload the data for the entire table and allow your Power Query steps including deduplication to be applied globally.

So while the logic for deduplication itself must still exist in Power Query or be handled via DAX, SSMS can be used to trigger the full reprocessing when needed.

Thanks,
Chaithra E.

AnnOminous · ‎07-29-2025

@v-echaithra, thank you again - this is all very helpful information.

v-echaithra · ‎07-24-2025

Hi @AnnOminous ,

Thank you @lbendlin for your inputs, You're absolutely right Power BI Incremental Refresh is designed for append only scenarios based on an immutable datetime column, such as CreatedDate. It's not intended to handle differential updates or Slowly Changing Dimensions Type 2 (SCD2), where rows may change over time and retain history using LastModifiedDate.

In this case, since duplicate user_ids are being introduced due to updated records appearing in new partitions, the only reliable solution is to reprocess all partitions that may be affected by those changes. This ensures that deduplication logic can be applied across the entire dataset, not just within each partition.

If full reprocessing isn't feasible or efficient, a common workaround is to:

Load all user records including duplicatesinto your Users table, Then use a DAX calculated table or virtual relationship to filter for only the latest record per user_id based on a ModifiedDate or EffectiveDate field. This way, you maintain the integrity of incremental refresh while ensuring only the latest version of each user is used in your model relationships and visuals.

Hope this helps.
Best Regards,
Chaithra E.

AnnOminous · ‎07-24-2025

@v-echaithra, thank you for the additional information. When you say that the only reliable solution is to reprocess every partition, that's effectively saying that I need to do a full refresh of that table, and apply the deduplication step to the refreshed data, correct? Or is it something I could manage in, say, SSMS for example?

lbendlin · ‎07-23-2025

It's called "Incremental" refresh for a reason. It is neither Differential Refresh nor is it supporting SCD2.

The partitions have to be based on datetime or dateinteger values, and these values need to filter an immutable date field. So "ID created" is good, "ID Last Modified" is bad.

To deduplicate - in your scenario - you would have to reprocess all affected partitions.