Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
john_ach
Frequent Visitor

Incremental Refresh - Duplicates or Missing Updates

Hi,

 

I'm having an issue getting incremental refresh to operate as I'd expect, and am hoping you you might be able to help me out.

 

I am building a set of reports on data from our Microsoft Dynamics PSA CRM tool. The desire is to have these reports show data as close to live as possible, meaning a scheduled refresh every half hour. To achieve that I need to have the refresh duration comfortably under 30 minutes.

 

My architecture / data pipeline looks like this: Dynamics -> SQL DB Export -> Dataflow (Extract) -> Dataflow (Transform 1) -> Dataflow (Transform 2) -> Dataset -> Report

 

In order to achieve the short (and efficient) refresh I have incremental refresh set up on the Dataflow (Extract) entities. The tables coming from Dynamics all have two datetime fields, createdon and modifiedon, which reflect when the record was created, and last modified. I have tried a combination of incremental refresh settings with those date fields and have not had the behaviour I would expect.

 

Desired incremental refresh behaviour:

  • All data kept for all time (e.g. not just last 5 years)
  • Rows reflect the original data source exactly
  • Only rows which have changed since the previous refresh are updated


Tested settings and results:

 

[A] No incremental refresh:

  • Dataflow reflects database exactly - Good
  • Full refresh slow - Bad


[B] Filter on createdon, store for 100 years, refresh from past 1 days, detect data changes on modifiedon:

  • Dataflow has same number of rows as database - Good
  • Dataflow does not include changes made in database - Bad


[C] Filter on modifiedon, store for 100 years, refresh from past 1 days, detect data changes on modifiedon:

  • Dataflow has duplicate entries for rows modified since previous refresh - Bad
  • Subsequent remove duplicates step required in dataflow - Bad
  • All database changes included in dataflow - Good


[D] Filter on createdon, store for 100 years, refresh from past 1 days, detect data changes off:

  • Dataflow has same number of rows as database - Good
  • Dataflow does not include changes made in database - Bad


[E] Filter on modifiedon, store for 100 years, refresh from past 1 days, detect data changes off:

  • Dataflow has duplicate entries for rows modified since previous refresh - Bad
  • Subsequent remove duplicates step required in dataflow - Bad
  • All database changes included in dataflow - Good


This behaviour doesn't seem to be correct, according to my understanding of the intention of incremental refresh. Please could you help enlighten me to my mistake(s) and misunderstanding, and please give guidance how the right datetime fields to use.

 

Thanks for your help

6 REPLIES 6
john_ach
Frequent Visitor

Hi - An update on the solution to my issue....

 

My mistake was in the columns I used to configure Incremental Refresh. This was due to my misunderstanding of how incremental refresh works - which I think could be improved across Microsoft documentation and others' tutorials/guides.

 

The correct combination is:

  1. Filter by record created date
  2. Detect changes on last modified date

This is because incremental refresh creates partitions (groups) of data using the value in the Filter (1) column (i.e. when that record was created, grouped into month long partitions). It is this Filter (1) column which is used as the 'key' for incremental refresh (not the actual table primary key). The maximum value of the detect changes column (2) is calculated for each group; if that changes the entire group is discarded and reloaded (by running the query on FIlter (1) column). What it does not do, is track specific records using their primary key.

 

My understanding is: when the data is refreshed:

  1. The whole Filter (1) column is loaded from the data source.
  2. The grouping is then applied, and a new maximum value of the detect changes column (2) is calculated per partition.
  3. For each partition, if the maximum value of the detect changes column (2) is different to the currently loaded data - then that partition is flagged to require a full refresh.
  4. Each flagged partition is then discarded.
  5. Each flagged partition is loaded from the data source using the Filter (1) column to load just the subset of data which meets that filter (e.g. data for July 2022).

This means that the value in Filter (1) column should not change for a given record, through its lifetime. If it does, then that record will fall under more than one partitition query through its lifetime.  If/when partitions are refreshed, that record could be duplicated in your loaded data. 

 

Detect changes column (2) must always change whenever any field of the record changes through its lifetime. If it does not, then that record (the partition that record falls in) will not be flagged as changed, and won't be refreshed. It makes sense that you could use a Detect changes column (2) which is only updated when (for example) columns A, B or C are changed if those are the only columns you load into your data model.

 

Hope that helps!

 

Redayac
Frequent Visitor

Hi John,

Did you find a solution to operate incremental refresh without duplicates?

Thank you!

ptacquet
Frequent Visitor

Hi,

 

I have the same problem using an update date for incremental refresh, that generates duplicate values.

The problem is  that i only refresh the last 10 days, but the last refresh from a row could be older in the past (more than one year). 

 

Any advice ? Why during the second refresh, the initial row is not updated ?

john_ach
Frequent Visitor

Hi,

 

Any suggestions as to where I'm going wrong, or is this expected behaviour?

 

Thanks for any help you can give

john_ach
Frequent Visitor

Hi,

 

Yes, I would like only data that has changed to be refreshed, while reflecting the contents of the original database exactly (no duplicates, and reflecting all changes).

 

I have configured incremental refresh as per the linked article. It doesn't specify which datetime field to use, or how the behaviour would change by using different datetime fields.

 

The example given in that article uses refresh on the createdon datettime field. That is [D] from my tests (original post) and results in the dataflow not reflecting changes made to the source database. 

I have also tried with detect data changes on, via the modifiedon datetime field, which is [B] from my tests. That gives the same (incorrect) results.

 

Is this expected behaviour? And/or what settings should I be using?


Thanks for your help

v-yingjl
Community Support
Community Support

Hi @john_ach ,

Only data that's changed needs to be refreshed under incremental refresh.

You can follow this document to set incremental refesh for dataflows:

Configuring incremental refresh for dataflows 

 

Best Regards,
Community Support Team _ Yingjie Li
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors