Re: Incremental Refresh - Duplicates or Missing Up...

john_ach · ‎04-28-2021

Hi,

I'm having an issue getting incremental refresh to operate as I'd expect, and am hoping you you might be able to help me out.

I am building a set of reports on data from our Microsoft Dynamics PSA CRM tool. The desire is to have these reports show data as close to live as possible, meaning a scheduled refresh every half hour. To achieve that I need to have the refresh duration comfortably under 30 minutes.

My architecture / data pipeline looks like this: Dynamics -> SQL DB Export -> Dataflow (Extract) -> Dataflow (Transform 1) -> Dataflow (Transform 2) -> Dataset -> Report

In order to achieve the short (and efficient) refresh I have incremental refresh set up on the Dataflow (Extract) entities. The tables coming from Dynamics all have two datetime fields, createdon and modifiedon, which reflect when the record was created, and last modified. I have tried a combination of incremental refresh settings with those date fields and have not had the behaviour I would expect.

Desired incremental refresh behaviour:

All data kept for all time (e.g. not just last 5 years)
Rows reflect the original data source exactly
Only rows which have changed since the previous refresh are updated

Tested settings and results:

[A] No incremental refresh:

Dataflow reflects database exactly - Good
Full refresh slow - Bad

[B] Filter on createdon, store for 100 years, refresh from past 1 days, detect data changes on modifiedon:

Dataflow has same number of rows as database - Good
Dataflow does not include changes made in database - Bad

[C] Filter on modifiedon, store for 100 years, refresh from past 1 days, detect data changes on modifiedon:

Dataflow has duplicate entries for rows modified since previous refresh - Bad
Subsequent remove duplicates step required in dataflow - Bad
All database changes included in dataflow - Good

[D] Filter on createdon, store for 100 years, refresh from past 1 days, detect data changes off:

Dataflow has same number of rows as database - Good
Dataflow does not include changes made in database - Bad

[E] Filter on modifiedon, store for 100 years, refresh from past 1 days, detect data changes off:

Dataflow has duplicate entries for rows modified since previous refresh - Bad
Subsequent remove duplicates step required in dataflow - Bad
All database changes included in dataflow - Good

This behaviour doesn't seem to be correct, according to my understanding of the intention of incremental refresh. Please could you help enlighten me to my mistake(s) and misunderstanding, and please give guidance how the right datetime fields to use.

Thanks for your help

john_ach · ‎12-15-2022

Hi - An update on the solution to my issue....

My mistake was in the columns I used to configure Incremental Refresh. This was due to my misunderstanding of how incremental refresh works - which I think could be improved across Microsoft documentation and others' tutorials/guides.

The correct combination is:

Filter by record created date
Detect changes on last modified date

This is because incremental refresh creates partitions (groups) of data using the value in the Filter (1) column (i.e. when that record was created, grouped into month long partitions). It is this Filter (1) column which is used as the 'key' for incremental refresh (not the actual table primary key). The maximum value of the detect changes column (2) is calculated for each group; if that changes the entire group is discarded and reloaded (by running the query on FIlter (1) column). What it does not do, is track specific records using their primary key.

My understanding is: when the data is refreshed:

The whole Filter (1) column is loaded from the data source.
The grouping is then applied, and a new maximum value of the detect changes column (2) is calculated per partition.
For each partition, if the maximum value of the detect changes column (2) is different to the currently loaded data - then that partition is flagged to require a full refresh.
Each flagged partition is then discarded.
Each flagged partition is loaded from the data source using the Filter (1) column to load just the subset of data which meets that filter (e.g. data for July 2022).

This means that the value in Filter (1) column should not change for a given record, through its lifetime. If it does, then that record will fall under more than one partitition query through its lifetime. If/when partitions are refreshed, that record could be duplicated in your loaded data.

Detect changes column (2) must always change whenever any field of the record changes through its lifetime. If it does not, then that record (the partition that record falls in) will not be flagged as changed, and won't be refreshed. It makes sense that you could use a Detect changes column (2) which is only updated when (for example) columns A, B or C are changed if those are the only columns you load into your data model.

Hope that helps!

Redayac · ‎12-15-2022

Hi John,

Did you find a solution to operate incremental refresh without duplicates?

Thank you!

ptacquet · ‎01-23-2022

Hi,

I have the same problem using an update date for incremental refresh, that generates duplicate values.

The problem is that i only refresh the last 10 days, but the last refresh from a row could be older in the past (more than one year).

Any advice ? Why during the second refresh, the initial row is not updated ?

john_ach · ‎05-13-2021

Hi,

Any suggestions as to where I'm going wrong, or is this expected behaviour?

Thanks for any help you can give

john_ach · ‎04-30-2021

Hi,

Yes, I would like only data that has changed to be refreshed, while reflecting the contents of the original database exactly (no duplicates, and reflecting all changes).

I have configured incremental refresh as per the linked article. It doesn't specify which datetime field to use, or how the behaviour would change by using different datetime fields.

The example given in that article uses refresh on the createdon datettime field. That is [D] from my tests (original post) and results in the dataflow not reflecting changes made to the source database.

I have also tried with detect data changes on, via the modifiedon datetime field, which is [B] from my tests. That gives the same (incorrect) results.

Is this expected behaviour? And/or what settings should I be using?

Thanks for your help

v-yingjl · ‎04-29-2021

Hi @john_ach ,

Only data that's changed needs to be refreshed under incremental refresh.

You can follow this document to set incremental refesh for dataflows:

Configuring incremental refresh for dataflows

Best Regards,
Community Support Team _ Yingjie Li
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Incremental Refresh - Duplicates or Missing Updates

Helpful resources

Join us at the Microsoft Fabric Community Conference

Join our Community Sticker Challenge 2025

Power BI Monthly Update - January 2025

Fabric Community Update - January 2025

New Offer! Become a Certified Fabric Data Engineer