Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
john_ach
Frequent Visitor

Incremental Refresh - Duplicates or Missing Updates

Hi,

 

I'm having an issue getting incremental refresh to operate as I'd expect, and am hoping you you might be able to help me out.

 

I am building a set of reports on data from our Microsoft Dynamics PSA CRM tool. The desire is to have these reports show data as close to live as possible, meaning a scheduled refresh every half hour. To achieve that I need to have the refresh duration comfortably under 30 minutes.

 

My architecture / data pipeline looks like this: Dynamics -> SQL DB Export -> Dataflow (Extract) -> Dataflow (Transform 1) -> Dataflow (Transform 2) -> Dataset -> Report

 

In order to achieve the short (and efficient) refresh I have incremental refresh set up on the Dataflow (Extract) entities. The tables coming from Dynamics all have two datetime fields, createdon and modifiedon, which reflect when the record was created, and last modified. I have tried a combination of incremental refresh settings with those date fields and have not had the behaviour I would expect.

 

Desired incremental refresh behaviour:

  • All data kept for all time (e.g. not just last 5 years)
  • Rows reflect the original data source exactly
  • Only rows which have changed since the previous refresh are updated


Tested settings and results:

 

[A] No incremental refresh:

  • Dataflow reflects database exactly - Good
  • Full refresh slow - Bad


[B] Filter on createdon, store for 100 years, refresh from past 1 days, detect data changes on modifiedon:

  • Dataflow has same number of rows as database - Good
  • Dataflow does not include changes made in database - Bad


[C] Filter on modifiedon, store for 100 years, refresh from past 1 days, detect data changes on modifiedon:

  • Dataflow has duplicate entries for rows modified since previous refresh - Bad
  • Subsequent remove duplicates step required in dataflow - Bad
  • All database changes included in dataflow - Good


[D] Filter on createdon, store for 100 years, refresh from past 1 days, detect data changes off:

  • Dataflow has same number of rows as database - Good
  • Dataflow does not include changes made in database - Bad


[E] Filter on modifiedon, store for 100 years, refresh from past 1 days, detect data changes off:

  • Dataflow has duplicate entries for rows modified since previous refresh - Bad
  • Subsequent remove duplicates step required in dataflow - Bad
  • All database changes included in dataflow - Good


This behaviour doesn't seem to be correct, according to my understanding of the intention of incremental refresh. Please could you help enlighten me to my mistake(s) and misunderstanding, and please give guidance how the right datetime fields to use.

 

Thanks for your help

6 REPLIES 6
john_ach
Frequent Visitor

Hi - An update on the solution to my issue....

 

My mistake was in the columns I used to configure Incremental Refresh. This was due to my misunderstanding of how incremental refresh works - which I think could be improved across Microsoft documentation and others' tutorials/guides.

 

The correct combination is:

  1. Filter by record created date
  2. Detect changes on last modified date

This is because incremental refresh creates partitions (groups) of data using the value in the Filter (1) column (i.e. when that record was created, grouped into month long partitions). It is this Filter (1) column which is used as the 'key' for incremental refresh (not the actual table primary key). The maximum value of the detect changes column (2) is calculated for each group; if that changes the entire group is discarded and reloaded (by running the query on FIlter (1) column). What it does not do, is track specific records using their primary key.

 

My understanding is: when the data is refreshed:

  1. The whole Filter (1) column is loaded from the data source.
  2. The grouping is then applied, and a new maximum value of the detect changes column (2) is calculated per partition.
  3. For each partition, if the maximum value of the detect changes column (2) is different to the currently loaded data - then that partition is flagged to require a full refresh.
  4. Each flagged partition is then discarded.
  5. Each flagged partition is loaded from the data source using the Filter (1) column to load just the subset of data which meets that filter (e.g. data for July 2022).

This means that the value in Filter (1) column should not change for a given record, through its lifetime. If it does, then that record will fall under more than one partitition query through its lifetime.  If/when partitions are refreshed, that record could be duplicated in your loaded data. 

 

Detect changes column (2) must always change whenever any field of the record changes through its lifetime. If it does not, then that record (the partition that record falls in) will not be flagged as changed, and won't be refreshed. It makes sense that you could use a Detect changes column (2) which is only updated when (for example) columns A, B or C are changed if those are the only columns you load into your data model.

 

Hope that helps!

 

Redayac
Frequent Visitor

Hi John,

Did you find a solution to operate incremental refresh without duplicates?

Thank you!

ptacquet
Frequent Visitor

Hi,

 

I have the same problem using an update date for incremental refresh, that generates duplicate values.

The problem is  that i only refresh the last 10 days, but the last refresh from a row could be older in the past (more than one year). 

 

Any advice ? Why during the second refresh, the initial row is not updated ?

john_ach
Frequent Visitor

Hi,

 

Any suggestions as to where I'm going wrong, or is this expected behaviour?

 

Thanks for any help you can give

john_ach
Frequent Visitor

Hi,

 

Yes, I would like only data that has changed to be refreshed, while reflecting the contents of the original database exactly (no duplicates, and reflecting all changes).

 

I have configured incremental refresh as per the linked article. It doesn't specify which datetime field to use, or how the behaviour would change by using different datetime fields.

 

The example given in that article uses refresh on the createdon datettime field. That is [D] from my tests (original post) and results in the dataflow not reflecting changes made to the source database. 

I have also tried with detect data changes on, via the modifiedon datetime field, which is [B] from my tests. That gives the same (incorrect) results.

 

Is this expected behaviour? And/or what settings should I be using?


Thanks for your help

v-yingjl
Community Support
Community Support

Hi @john_ach ,

Only data that's changed needs to be refreshed under incremental refresh.

You can follow this document to set incremental refesh for dataflows:

Configuring incremental refresh for dataflows 

 

Best Regards,
Community Support Team _ Yingjie Li
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

Power BI Carousel June 2024

Power BI Monthly Update - June 2024

Check out the June 2024 Power BI update to learn about new features.

PBI_Carousel_NL_June

Fabric Community Update - June 2024

Get the latest Fabric updates from Build 2024, key Skills Challenge voucher deadlines, top blogs, forum posts, and product ideas.

Top Solution Authors
Top Kudoed Authors