Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
DebbieE
Community Champion
Community Champion

Premium Incremental Refresh Detect data changes (How does it identify the correct row?)

Doing Incremental Refresh in Premium

I have a last update date in my data. When I detect data changes set to on, it checks the row and if this date has changed then the record is updated. 

However how does it know that the record in Power BI and the record coming in is the same record?

 

I have searched the documentation and cant find anything on this. Im assuming its the Primary key of the record or the key ? However this isnt set anywhere tht I can see

 

What about when you are loading in the fact table? The fact table contains foreign keys. Basically how doesthe incremental refresh know that this is the row that replaces the row in Power BI?

https://docs.microsoft.com/en-us/power-bi/service-premium-incremental-refresh#detect-data-changes Doesnt seem to answer this question

28 REPLIES 28
gmelhaff
Advocate I
Advocate I

Fascinating conversation.  Coworker and myself have been beating our heads against the wall trying to figure out how this works - its way more complicated than necessary in my opinion.  But....I think the short answer to the original question  (which never got directly addressed),= is it doesn't need to know what the unique key is to find the same record.  And that's because it just reloads the entire partition involved in a change.  Just deletes by partition, not by key value. 

 

But then lot of this conversation is making assumptions based on documentation rather than actual data tests.  It's unfortunate that Microsoft isn't very revealing of the internals on how this works.  And note...if you use the "Get the latest data in real time with DirectQuery" option then it won't load anything for the current partition - instead it's going against your source for the most recent data.  

 

Our tests did this...our data goes back 50 years and can change at any time point in history.  The reason is we merge customer records and re-arrange households and golden records on a daily basis so all of history can get a new key value in our fact tables.  So our refresh and archive settings are 50 years.  Detect data changes is turned on and Get latest data in real time and only refresh complete years is turned off.  We set the refresh column to modified timestamp column and setup the parm/filter to business date timestamp such as an order date (converted to time via DAX function) which of course becomes the basis for the partition ranges. 

We then updated the modified timestamp to getdate() for specific orders in history.  We then profiled the SQL during refresh after waiting a couple of minutes.  What we observed was first a query to get max(modifiedtime) for every partition.  Then it does a partition range query for every partition where there was a newer modified timestamp.  So how it behaves like what was said here.   

 

The one exception was...when I also modified the date which is the basis for the partition so it should move partitions (such as order date) then nothing happened during refresh. I was hoping it was smart enough to rebuild both partitions involved but instead it did nothing at all.  My assumption is it gets confused if the partitioning date changes (which they say isn't supposed to).  The solution to this will be just knowing on a daily basis the data will leak these kind of changes and say every weekend do complete refresh to get these leaks resolved.

GilbertQ
Super User
Super User

Hi there

From my understanding everything is based off a dateTime column. This includes when detecting the changes.

When the incremental refresh runs, it will look at the refresh policy as well as at the detect changes column and which ever data fits into the refresh policy it will then go and get those new rows based on the dateTime of your data.




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!







Power BI Blog

DebbieE
Community Champion
Community Champion

This is what Im not understanding at all. You cant just base it on LastUpdatedate

 

Say you have the following in Power BI

 

Key    ID     Surname     lastUpdatedate

1       24     Smyth          01/09/2019

2       48      Jonis          01/09/2019

 

And you upload this data

Key    ID     Surname     lastUpdatedate

1       24     Smith          21/09/2019

2       48      Jones          21/09/2019

 

You cant overwrite the correct record simply using the LastUpdatedate. There doesnt appear to be any information on this anywhere but If it was simply down to LastUpdateDate then I dont understand how you can trust Incremental processing in Power BI?

 

Surely you have to supply more details like IDs or something else?

Hi @DebbieE 

 

When you are configuring your Power BI Incremental Refreshing you can set it on the following screen to detect data changes

 

image.png

Here are more details

https://docs.microsoft.com/en-us/power-bi/service-premium-incremental-refresh#detect-data-changes





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!







Power BI Blog

nickyvv
Community Champion
Community Champion

Hi @DebbieE, I understand you concerns, but let me clarify:

 

Because you speak about a fact table I assume you have basic knowledge of a DWH scenario. Otherwise please let me know and I can explain this further.

In a DWH you would have a Slowly Changing Dimension (type 2 for example) on the fact table usually with valid_from and valid_to columns (or similar). When processing changes these columns get updated with new values and new rows are inserted when necessary.

In Power BI we have the RangeStart and RangeEnd parameters, which you define on a column in your (fact) table. An important difference however: this parameter is NOT defined on the SCD-2 columns in your fact table, but rather on a functional date column like OrderDate or TransactionDate. Incremental refresh isn't designed to support cases where the filtered date column is updated in the... 

After defining your parameters and filtering on them in your model, the ranges you set in the refresh policy are working as follows:

  • You define a range to store the amount of rows (e.g. 5 years of data)
  • you define a refresh range to refresh to amount of days/months/years (e.g. refresh rows in the last 10 days)

The important clarification is this: the refresh will do a delete-insert rather than an update of the last 10 days of data (based on your parameters defined on the functional date column. So this means: the whole set of 10 days is deleted from your dataset, and refreshed with the new rows of 10 days worth of data.

From the documentation:

An update is interpreted as an insertion and a deletion, not an actual update. If the deletion occurs in the historical range and not the incremental range, it won’t get picked up. This can cause data refresh failures due to partition-key conflicts. 

 

I hope this clarifies your concern a bit. If you have any questions don't hesitate to ask them!

 

Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV



Did I answer your question? Mark my post as a solution!

Blog: nickyvv.com | @NickyvV


DebbieE
Community Champion
Community Champion

Im so sorry Im really really struggling to understand this concept.

Going from the beginning

Switch Incremental Refresh policy on in Options

Power Query Editor. Add RangeStart and RangeEnd with dates for that will be overwritten so it doesnt matter what you put here

Q What constitutes a range here? Im not sure what this means

Filter your data in desktop using RangeStart and RangeEnd so you dont have a massive model within your desktop

Set your Incremental Refresh policy

Storing 60 months of data (LastUpdateDate is in the last 60 months)

Refreshing 10 days of data where LastUpdateDate is in that 10 days

Todays date is the 28th January

All the data was loaded Q Do you have to do this before you set up the inremental refresh?

 

In Power BI

Key    ID     Surname     lastUpdatedate

1       24     Smyth          15/01/2020

2       48      Jonis          15/01/2029

 

And you upload this data

Key    ID     Surname     lastUpdatedate

1       24     Smith          27/09/2019

2       48      Jones          27/09/2019

 

What would happen in the above situation?

 

It says that anything in the last 10 days gets deleted. And anything in the last 10 days coming in gets loaded. But because the last updateDate in Power BI is over 10 days old, surely these will remain and the new 2 records will be loaded meaning that you now have duplicates?

 

How does the incremental refresh know to delete Where Key = 1 and then Add Where Key = 1 if its just based on the date?

 

I dont really understand how RangeStart and RangeEnd fits with what Im doing either at the moment. 

And because I dont understand the basics Detect data changes has really confused me

 

 

nickyvv
Community Champion
Community Champion

@DebbieE no problem, let me try to clarify.

I'll start at your last question, because I think that holds the most important point/answer.

In the Filter date column updates it says:

Incremental refresh isn't designed to support cases where the filtered date column is updated in the source system.

So the scenario you sketch is not possible with Incremental Refresh, because your LastUpdateDate changes. In the example an OrderDate is chosen, which typically never changes for a given order.

 

Regarding your other question:

The first refresh operation in the service loads historical data (up to the 60 months of data in your example). Subsequent refreshes are incremental and loads 10 days of data where LastUpdateDate is in that 10 days.

 

So you'll have to choose a column in your fact table that holds a (functional) date that does not change for a given fact.

Does this clear things up?



Did I answer your question? Mark my post as a solution!

Blog: nickyvv.com | @NickyvV


DebbieE
Community Champion
Community Champion

I’m so sorry I’m still asking question on this but Im still very confused

 

With this example

 

Date of Upload 28/01/2020

In Power BI already

Key    ID     Surname          OrderDate         lastUpdatedate

1       24     Smyth              15/01/2020         15/01/2020

2       48      Jonis               15/01/2020         15/01/2020

3       67      Ellis                15/01/2020          15/01/2020

4       81      Truman          15/01/2020          15/01/2020

 

And you upload this data

Key    ID     Surname     OrderDate              lastUpdatedate

1       24     Smith          15/01/2020              27/09/2019

2       48      Jones          15/01/2020             27/09/2019

3       67      Ellis                15/01/2020          15/01/2020

4       81      Truman          15/01/2020          15/01/2020

 

RangeStart and RangeEnd has already been added and the filter set in desktop because we have a lot of data in this table. It was set up on LastUpdateDate because this makes sense

 

However

 

Set the Incremental Refresh policy now has "LastUpdateDate in the StoreRowsWhereColumn LastUpdateDate is in the last …….."

 

So it must be taking this column information from the filter I set? Is this correct?

 

So your saying I need to replace the filter and swap it to Order date because it doesn’t change?

Then incremental refresh wont work for this standard use case because, If I set Order date to last 10 days. It now wont delete Key 1 and 2 or readd because they are over 10 days old.

 

This means I have to set the Refresh Rows for a much longer time period. OK that bit makes sense.

 

But what is the point of the LastUpdateDate then. Surely these fields always change on rows that can be updated?

 

Also this still makes no sense of how detect data changes works.

How does it know that Key 1 has changed and does need deleting just based on Date

DebbieE
Community Champion
Community Champion

Ahhhhhhhh I think it just makes more sense apart from a couple of things

 

You use Order date because it doesnt change to delete or readd records (So make sure you understand how far back your records can change)

 

And LastUpdateDate is used to detect data changes in the record (It just checks that field has changed. However again Im still unsure how it understands that the row with Key1 is the same as the new row with Key 1 unless it uses key. So this Im still unsure of

DebbieE
Community Champion
Community Champion

Still desperately looking for the last piece to the puzzle with this....

 

And LastUpdateDate is used to detect data changes in the record (It just checks that field has changed). However again Im still unsure how it understands that the row with Key1 is the same as the new row with Key 1 unless it uses key. So this Im still unsure of

Hi there

It is all based on the DateTime on the column that is used to Detect the data changes.

It has got nothing to do with the Key




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!







Power BI Blog

DebbieE
Community Champion
Community Champion

I keep being told  the same thing about the date but I still don’t understand.  it needs more information to be properly understood.

 

I’m just not in a position to explain how this works or say that it does work as we want it too at the moment.

Date of Upload 28/01/2020

In Power BI already

Key    ID     Surname          OrderDate         lastUpdatedate

1       24     Smyth              15/01/2020         15/01/2020

2       48      Jonis               15/01/2020         15/01/2020

3       67      Ellis                15/01/2020          15/01/2020

4       81      Truman          15/01/2020          15/01/2020

 

And you upload this data

Key    ID     Surname     OrderDate              lastUpdatedate

1       24     Smith          15/01/2020              27/09/2019

2       48      Jones          15/01/2020             27/09/2019

3       67      Ellis                15/01/2020          15/01/2020

4       81      Truman          15/01/2020          15/01/2020

 

Without detect data changes ticked all these 4 records would be deleted and then reloaded because the date is within our range

With detect data changes ticked

1       24     Smith          15/01/2020              27/09/2019

 

How does it know that this specific row has changed? Clearly last time LastUpdateDate was 15/01/2020 and now it’s the 27/01/2020 but lots of rows have also been updated. How does it know that that specific row has changed without using the ID? 

 

Usually the logic for updates, new records etc is quite complex and I cant understand how this just looks at date for detecting data changes

Hi there

All that it does is it knows that your last updatedDate, it then looks for rows where this is newer than the last date and updates those rows?




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!







Power BI Blog

DebbieE
Community Champion
Community Champion

Im clearly never going to understand this. It just doesnt make sense to me

 

In Power BI

lastUpdatedate

15/01/2020

15/01/2020

15/01/2020

15/01/2020

 

And you upload this data

lastUpdatedate

27/09/2019

27/09/2019

15/01/2020

15/01/2020

27/09/2020

27/09/2020

 

Just using this date there is simply no way of knowing which record has changed unless you take the ID into account

 

 

 

nickyvv
Community Champion
Community Champion

@DebbieE no problem, let me try to explain it.

First, your example is a bit strange because the LastUpdateDate changes from 15/01/2020 to 27/09/2019?

 

I think it works this way, and the crux is I think in the following: The maximum value of this column [LastUpdateDate] is evaluated for each of the periods in the incremental range (from the docs).

So for every X days (where X is defined by the option: Refresh rows in the last ... days), the maximum value of LastUpdateDate is compared to the current value. If it is higher, this whole day of data is deleted and updated in the dataset. So there is no notion of keys or rows whatsoever in the dataset. It only knows which day it is (by the OrderDate column) and knows if the refresh it by the LastUpdateDate column.

Does that make sense? 🙂

 

Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV



Did I answer your question? Mark my post as a solution!

Blog: nickyvv.com | @NickyvV


DebbieE
Community Champion
Community Champion

Honestly no it doesnt make sense and at present I wouldnt be happy to use this option

 

You have thousands and thousands of rows. Without any other Identifier you basically use order date to bring together the data set that we are going to process. Thats fine.

 

But then using LastUpdateDate I cant understand, with thousands of records how you know which row is which so you can delete and amend them

 

You cant identify a record with just a date when lots of other records may have that date. Its not unique. Im completely at a loss with this one 

 

nickyvv
Community Champion
Community Champion

@DebbieE I'll try again 🙂

 

I'll start with this statement from the docsThe maximum value of the column [LastUpdateDate] is evaluated for each of the periods in the incremental range.

(I'll repeat my last answer and explain a bit further): So for every day in the last 10 days (if you set the option Refresh rows in the last 10 days), the maximum value of LastUpdateDate is compared to the current value in Power BI. If it is higher, this whole day of data is deleted and updated in the dataset. So there is no notion of keys or rows whatsoever in the dataset. It only knows which day it is (by the OrderDate column) and knows if the refresh it by the LastUpdateDate column.

So if for day 1 the LastUpdateDate is higher in the datasource than the value of LastUpdateDate in the Power BI dataset, this day is deleted and fully refreshed.

Then it does the same for day 2, day 3, ... day 10. If any of the LastUpdateDate values is higher in the datasource then in the Power BI dataset, it refreshes this day completely, without looking at individual rows or keys.

 

Let me know if this enlightens you 😀

 

Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV



Did I answer your question? Mark my post as a solution!

Blog: nickyvv.com | @NickyvV


DebbieE
Community Champion
Community Champion

 

"So for every day in the last 10 days (if you set the option Refresh rows in the last 10 days), the maximum value of LastUpdateDate is compared to the current value in Power BI"

 

Are you saying that if for instance I have this data in Power BI all for the last 10 days 

 

Orderdate      LastUpdateDate

01/01/2020    01/01/2020

02/01/2020    02/01/2020

03/01/2020    03/01/2020

04/01/2020    04/01/2020

05/01/2020    05/01/2020

06/01/2020    06/01/2020

07/01/2020    07/01/2020

08/01/2020    08/01/2020

09/01/2020    09/01/2020

10/01/2020    10/01/2020

 

And the 11th we import this

 

Orderdate      LastUpdateDate

01/01/2020    01/01/2020

02/01/2020    11/01/2020

03/01/2020    11/01/2020

04/01/2020    04/01/2020

05/01/2020    05/01/2020

06/01/2020    06/01/2020

07/01/2020    07/01/2020

08/01/2020    08/01/2020

09/01/2020    09/01/2020

10/01/2020    10/01/2020

11/01/2020    11/01/2020

11/01/2020    11/01/2020

11/01/2020    11/01/2020

 

Without it ticked,  Everything for the last 10 days would be deleted and thenre added. 01/02/2020 would be left as is because its not in the 10 days

 

With Detect Data Change it actually looks at OrderDate, sees that LastUpdateDate is different to OrderDate and only deletes the ones where LastUpdateDate is not equal to OrderDate?

 

This is the only other way I can see if happening without including row identifiers

Otherwise you cant tell 11/01/2020 is changed by itsself, it could simply be a new record. You cant compare LastUpdateDate with no other information

 

And then the more I think about it this still doesnt make sence. Because when you import on the 12th 

03/01/2020    11/01/2020

We already did this but surely because the date is different it will be deleted and readding again. I think a big issue is that Microsoft are trying to do this as a black box. It just works, you dont need to know why but you do need to know why...

 

Still lost

nickyvv
Community Champion
Community Champion

Hi @DebbieE, we are getting there 😁

 


@DebbieE wrote:

With Detect Data Change it actually looks at OrderDate, sees that LastUpdateDate is different to OrderDate and only deletes the ones where LastUpdateDate is not equal to OrderDate?

You are almost right, except it compares LastUpdateDate with the Max LastUpdateDate for that OrderDate (in your example OrderDate=LastUpdateDate, but it is (or could be) a small difference).

 

And in your next question, the thing is: 

@DebbieE wrote:

And then the more I think about it this still doesnt make sence. Because when you import on the 12th 

03/01/2020    11/01/2020

We already did this but surely because the date is different it will be deleted and readding again.

03/01/2020    11/01/2020

is already in the Power BI dataset because this was updated the 11th, right?

So from the 11th to the 12th there is no change (the dates are the same), so it won't update OrderDate 03/01/2020.

 

Hope this helps.

 

Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV



Did I answer your question? Mark my post as a solution!

Blog: nickyvv.com | @NickyvV


DebbieE
Community Champion
Community Champion

Nope, This is getting away from me even further now

 

@DebbieE wrote:

With Detect Data Change it actually looks at OrderDate, sees that LastUpdateDate is different to OrderDate and only deletes the ones where LastUpdateDate is not equal to OrderDate?

You are almost right, except it compares LastUpdateDate with the Max LastUpdateDate for that OrderDate (in your example OrderDate=LastUpdateDate, but it is (or could be) a small difference).

 

"That Order Date" You might have thousands of records with the same order date. You can say That ID because thats unique but OrderDate isnt.  So you cant compare LastUpdatedate with the Max Last UpdateDate without having an ID to say that its the same

row?

 

This has completely thrown me 

 

And in your next question, the thing is: 

@DebbieE wrote:

And then the more I think about it this still doesnt make sence. Because when you import on the 12th 

03/01/2020    11/01/2020

We already did this but surely because the date is different it will be deleted and readding again.

03/01/2020    11/01/2020

is already in the Power BI dataset because this was updated the 11th, right?

So from the 11th to the 12th there is no change (the dates are the same), so it won't update OrderDate 03/01/2020.

 

Nope, completely lost. Unless it understands that that row is the same row because there may be thousands of OrderDates of 03/01/2020 and there may be another record of 03/01/2020, that did get updated on the 11th but also gets updated on the the 12th

 

I should also say that its not just me that is struggling to understand this. Ive been speaking to alot of people who cant grasp this concept just like myself. They dont feel they can use this until they fully understand whats going on. And unfortunately it looks like I cant help because it doesnt seem  right to me. 

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

AugPowerBI_Carousel

Power BI Monthly Update - August 2024

Check out the August 2024 Power BI update to learn about new features.

August Carousel

Fabric Community Update - August 2024

Find out what's new and trending in the Fabric Community.