The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Doing Incremental Refresh in Premium
I have a last update date in my data. When I detect data changes set to on, it checks the row and if this date has changed then the record is updated.
However how does it know that the record in Power BI and the record coming in is the same record?
I have searched the documentation and cant find anything on this. Im assuming its the Primary key of the record or the key ? However this isnt set anywhere tht I can see
What about when you are loading in the fact table? The fact table contains foreign keys. Basically how doesthe incremental refresh know that this is the row that replaces the row in Power BI?
https://docs.microsoft.com/en-us/power-bi/service-premium-incremental-refresh#detect-data-changes Doesnt seem to answer this question
Fascinating conversation. Coworker and myself have been beating our heads against the wall trying to figure out how this works - its way more complicated than necessary in my opinion. But....I think the short answer to the original question (which never got directly addressed),= is it doesn't need to know what the unique key is to find the same record. And that's because it just reloads the entire partition involved in a change. Just deletes by partition, not by key value.
But then lot of this conversation is making assumptions based on documentation rather than actual data tests. It's unfortunate that Microsoft isn't very revealing of the internals on how this works. And note...if you use the "Get the latest data in real time with DirectQuery" option then it won't load anything for the current partition - instead it's going against your source for the most recent data.
Our tests did this...our data goes back 50 years and can change at any time point in history. The reason is we merge customer records and re-arrange households and golden records on a daily basis so all of history can get a new key value in our fact tables. So our refresh and archive settings are 50 years. Detect data changes is turned on and Get latest data in real time and only refresh complete years is turned off. We set the refresh column to modified timestamp column and setup the parm/filter to business date timestamp such as an order date (converted to time via DAX function) which of course becomes the basis for the partition ranges.
We then updated the modified timestamp to getdate() for specific orders in history. We then profiled the SQL during refresh after waiting a couple of minutes. What we observed was first a query to get max(modifiedtime) for every partition. Then it does a partition range query for every partition where there was a newer modified timestamp. So how it behaves like what was said here.
The one exception was...when I also modified the date which is the basis for the partition so it should move partitions (such as order date) then nothing happened during refresh. I was hoping it was smart enough to rebuild both partitions involved but instead it did nothing at all. My assumption is it gets confused if the partitioning date changes (which they say isn't supposed to). The solution to this will be just knowing on a daily basis the data will leak these kind of changes and say every weekend do complete refresh to get these leaks resolved.
This is what Im not understanding at all. You cant just base it on LastUpdatedate
Say you have the following in Power BI
Key ID Surname lastUpdatedate
1 24 Smyth 01/09/2019
2 48 Jonis 01/09/2019
And you upload this data
Key ID Surname lastUpdatedate
1 24 Smith 21/09/2019
2 48 Jones 21/09/2019
You cant overwrite the correct record simply using the LastUpdatedate. There doesnt appear to be any information on this anywhere but If it was simply down to LastUpdateDate then I dont understand how you can trust Incremental processing in Power BI?
Surely you have to supply more details like IDs or something else?
Hi @DebbieE
When you are configuring your Power BI Incremental Refreshing you can set it on the following screen to detect data changes
Here are more details
https://docs.microsoft.com/en-us/power-bi/service-premium-incremental-refresh#detect-data-changes
Hi @DebbieE, I understand you concerns, but let me clarify:
Because you speak about a fact table I assume you have basic knowledge of a DWH scenario. Otherwise please let me know and I can explain this further.
In a DWH you would have a Slowly Changing Dimension (type 2 for example) on the fact table usually with valid_from and valid_to columns (or similar). When processing changes these columns get updated with new values and new rows are inserted when necessary.
In Power BI we have the RangeStart and RangeEnd parameters, which you define on a column in your (fact) table. An important difference however: this parameter is NOT defined on the SCD-2 columns in your fact table, but rather on a functional date column like OrderDate or TransactionDate. Incremental refresh isn't designed to support cases where the filtered date column is updated in the...
After defining your parameters and filtering on them in your model, the ranges you set in the refresh policy are working as follows:
The important clarification is this: the refresh will do a delete-insert rather than an update of the last 10 days of data (based on your parameters defined on the functional date column. So this means: the whole set of 10 days is deleted from your dataset, and refreshed with the new rows of 10 days worth of data.
From the documentation:
An update is interpreted as an insertion and a deletion, not an actual update. If the deletion occurs in the historical range and not the incremental range, it won’t get picked up. This can cause data refresh failures due to partition-key conflicts.
I hope this clarifies your concern a bit. If you have any questions don't hesitate to ask them!
Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV
Im so sorry Im really really struggling to understand this concept.
Going from the beginning
Switch Incremental Refresh policy on in Options
Power Query Editor. Add RangeStart and RangeEnd with dates for that will be overwritten so it doesnt matter what you put here
Q What constitutes a range here? Im not sure what this means
Filter your data in desktop using RangeStart and RangeEnd so you dont have a massive model within your desktop
Set your Incremental Refresh policy
Storing 60 months of data (LastUpdateDate is in the last 60 months)
Refreshing 10 days of data where LastUpdateDate is in that 10 days
Todays date is the 28th January
All the data was loaded Q Do you have to do this before you set up the inremental refresh?
In Power BI
Key ID Surname lastUpdatedate
1 24 Smyth 15/01/2020
2 48 Jonis 15/01/2029
And you upload this data
Key ID Surname lastUpdatedate
1 24 Smith 27/09/2019
2 48 Jones 27/09/2019
What would happen in the above situation?
It says that anything in the last 10 days gets deleted. And anything in the last 10 days coming in gets loaded. But because the last updateDate in Power BI is over 10 days old, surely these will remain and the new 2 records will be loaded meaning that you now have duplicates?
How does the incremental refresh know to delete Where Key = 1 and then Add Where Key = 1 if its just based on the date?
I dont really understand how RangeStart and RangeEnd fits with what Im doing either at the moment.
And because I dont understand the basics Detect data changes has really confused me
@DebbieE no problem, let me try to clarify.
I'll start at your last question, because I think that holds the most important point/answer.
In the Filter date column updates it says:
Incremental refresh isn't designed to support cases where the filtered date column is updated in the source system.
So the scenario you sketch is not possible with Incremental Refresh, because your LastUpdateDate changes. In the example an OrderDate is chosen, which typically never changes for a given order.
Regarding your other question:
The first refresh operation in the service loads historical data (up to the 60 months of data in your example). Subsequent refreshes are incremental and loads 10 days of data where LastUpdateDate is in that 10 days.
So you'll have to choose a column in your fact table that holds a (functional) date that does not change for a given fact.
Does this clear things up?
I’m so sorry I’m still asking question on this but Im still very confused
With this example
Date of Upload 28/01/2020
In Power BI already
Key ID Surname OrderDate lastUpdatedate
1 24 Smyth 15/01/2020 15/01/2020
2 48 Jonis 15/01/2020 15/01/2020
3 67 Ellis 15/01/2020 15/01/2020
4 81 Truman 15/01/2020 15/01/2020
And you upload this data
Key ID Surname OrderDate lastUpdatedate
1 24 Smith 15/01/2020 27/09/2019
2 48 Jones 15/01/2020 27/09/2019
3 67 Ellis 15/01/2020 15/01/2020
4 81 Truman 15/01/2020 15/01/2020
RangeStart and RangeEnd has already been added and the filter set in desktop because we have a lot of data in this table. It was set up on LastUpdateDate because this makes sense
However
Set the Incremental Refresh policy now has "LastUpdateDate in the StoreRowsWhereColumn LastUpdateDate is in the last …….."
So it must be taking this column information from the filter I set? Is this correct?
So your saying I need to replace the filter and swap it to Order date because it doesn’t change?
Then incremental refresh wont work for this standard use case because, If I set Order date to last 10 days. It now wont delete Key 1 and 2 or readd because they are over 10 days old.
This means I have to set the Refresh Rows for a much longer time period. OK that bit makes sense.
But what is the point of the LastUpdateDate then. Surely these fields always change on rows that can be updated?
Also this still makes no sense of how detect data changes works.
How does it know that Key 1 has changed and does need deleting just based on Date
Ahhhhhhhh I think it just makes more sense apart from a couple of things
You use Order date because it doesnt change to delete or readd records (So make sure you understand how far back your records can change)
And LastUpdateDate is used to detect data changes in the record (It just checks that field has changed. However again Im still unsure how it understands that the row with Key1 is the same as the new row with Key 1 unless it uses key. So this Im still unsure of
Still desperately looking for the last piece to the puzzle with this....
And LastUpdateDate is used to detect data changes in the record (It just checks that field has changed). However again Im still unsure how it understands that the row with Key1 is the same as the new row with Key 1 unless it uses key. So this Im still unsure of
I keep being told the same thing about the date but I still don’t understand. it needs more information to be properly understood.
I’m just not in a position to explain how this works or say that it does work as we want it too at the moment.
Date of Upload 28/01/2020
In Power BI already
Key ID Surname OrderDate lastUpdatedate
1 24 Smyth 15/01/2020 15/01/2020
2 48 Jonis 15/01/2020 15/01/2020
3 67 Ellis 15/01/2020 15/01/2020
4 81 Truman 15/01/2020 15/01/2020
And you upload this data
Key ID Surname OrderDate lastUpdatedate
1 24 Smith 15/01/2020 27/09/2019
2 48 Jones 15/01/2020 27/09/2019
3 67 Ellis 15/01/2020 15/01/2020
4 81 Truman 15/01/2020 15/01/2020
Without detect data changes ticked all these 4 records would be deleted and then reloaded because the date is within our range
With detect data changes ticked
1 24 Smith 15/01/2020 27/09/2019
How does it know that this specific row has changed? Clearly last time LastUpdateDate was 15/01/2020 and now it’s the 27/01/2020 but lots of rows have also been updated. How does it know that that specific row has changed without using the ID?
Usually the logic for updates, new records etc is quite complex and I cant understand how this just looks at date for detecting data changes
Im clearly never going to understand this. It just doesnt make sense to me
In Power BI
lastUpdatedate
15/01/2020
15/01/2020
15/01/2020
15/01/2020
And you upload this data
lastUpdatedate
27/09/2019
27/09/2019
15/01/2020
15/01/2020
27/09/2020
27/09/2020
Just using this date there is simply no way of knowing which record has changed unless you take the ID into account
@DebbieE no problem, let me try to explain it.
First, your example is a bit strange because the LastUpdateDate changes from 15/01/2020 to 27/09/2019?
I think it works this way, and the crux is I think in the following: The maximum value of this column [LastUpdateDate] is evaluated for each of the periods in the incremental range (from the docs).
So for every X days (where X is defined by the option: Refresh rows in the last ... days), the maximum value of LastUpdateDate is compared to the current value. If it is higher, this whole day of data is deleted and updated in the dataset. So there is no notion of keys or rows whatsoever in the dataset. It only knows which day it is (by the OrderDate column) and knows if the refresh it by the LastUpdateDate column.
Does that make sense? 🙂
Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV
Honestly no it doesnt make sense and at present I wouldnt be happy to use this option
You have thousands and thousands of rows. Without any other Identifier you basically use order date to bring together the data set that we are going to process. Thats fine.
But then using LastUpdateDate I cant understand, with thousands of records how you know which row is which so you can delete and amend them
You cant identify a record with just a date when lots of other records may have that date. Its not unique. Im completely at a loss with this one
@DebbieE I'll try again 🙂
I'll start with this statement from the docs: The maximum value of the column [LastUpdateDate] is evaluated for each of the periods in the incremental range.
(I'll repeat my last answer and explain a bit further): So for every day in the last 10 days (if you set the option Refresh rows in the last 10 days), the maximum value of LastUpdateDate is compared to the current value in Power BI. If it is higher, this whole day of data is deleted and updated in the dataset. So there is no notion of keys or rows whatsoever in the dataset. It only knows which day it is (by the OrderDate column) and knows if the refresh it by the LastUpdateDate column.
So if for day 1 the LastUpdateDate is higher in the datasource than the value of LastUpdateDate in the Power BI dataset, this day is deleted and fully refreshed.
Then it does the same for day 2, day 3, ... day 10. If any of the LastUpdateDate values is higher in the datasource then in the Power BI dataset, it refreshes this day completely, without looking at individual rows or keys.
Let me know if this enlightens you 😀
Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV
"So for every day in the last 10 days (if you set the option Refresh rows in the last 10 days), the maximum value of LastUpdateDate is compared to the current value in Power BI"
Are you saying that if for instance I have this data in Power BI all for the last 10 days
Orderdate LastUpdateDate
01/01/2020 01/01/2020
02/01/2020 02/01/2020
03/01/2020 03/01/2020
04/01/2020 04/01/2020
05/01/2020 05/01/2020
06/01/2020 06/01/2020
07/01/2020 07/01/2020
08/01/2020 08/01/2020
09/01/2020 09/01/2020
10/01/2020 10/01/2020
And the 11th we import this
Orderdate LastUpdateDate
01/01/2020 01/01/2020
02/01/2020 11/01/2020
03/01/2020 11/01/2020
04/01/2020 04/01/2020
05/01/2020 05/01/2020
06/01/2020 06/01/2020
07/01/2020 07/01/2020
08/01/2020 08/01/2020
09/01/2020 09/01/2020
10/01/2020 10/01/2020
11/01/2020 11/01/2020
11/01/2020 11/01/2020
11/01/2020 11/01/2020
Without it ticked, Everything for the last 10 days would be deleted and thenre added. 01/02/2020 would be left as is because its not in the 10 days
With Detect Data Change it actually looks at OrderDate, sees that LastUpdateDate is different to OrderDate and only deletes the ones where LastUpdateDate is not equal to OrderDate?
This is the only other way I can see if happening without including row identifiers
Otherwise you cant tell 11/01/2020 is changed by itsself, it could simply be a new record. You cant compare LastUpdateDate with no other information
And then the more I think about it this still doesnt make sence. Because when you import on the 12th
03/01/2020 11/01/2020
We already did this but surely because the date is different it will be deleted and readding again. I think a big issue is that Microsoft are trying to do this as a black box. It just works, you dont need to know why but you do need to know why...
Still lost
Hi @DebbieE, we are getting there 😁
@DebbieE wrote:With Detect Data Change it actually looks at OrderDate, sees that LastUpdateDate is different to OrderDate and only deletes the ones where LastUpdateDate is not equal to OrderDate?
You are almost right, except it compares LastUpdateDate with the Max LastUpdateDate for that OrderDate (in your example OrderDate=LastUpdateDate, but it is (or could be) a small difference).
And in your next question, the thing is:
@DebbieE wrote:And then the more I think about it this still doesnt make sence. Because when you import on the 12th
03/01/2020 11/01/2020
We already did this but surely because the date is different it will be deleted and readding again.
03/01/2020 11/01/2020
is already in the Power BI dataset because this was updated the 11th, right?
So from the 11th to the 12th there is no change (the dates are the same), so it won't update OrderDate 03/01/2020.
Hope this helps.
Did this help you or did I answer your question?
Then please give kudos or mark my post as a solution!
My blog: nickyvv.com
Twitter: @NickyvV
Nope, This is getting away from me even further now
@DebbieE wrote:With Detect Data Change it actually looks at OrderDate, sees that LastUpdateDate is different to OrderDate and only deletes the ones where LastUpdateDate is not equal to OrderDate?
You are almost right, except it compares LastUpdateDate with the Max LastUpdateDate for that OrderDate (in your example OrderDate=LastUpdateDate, but it is (or could be) a small difference).
"That Order Date" You might have thousands of records with the same order date. You can say That ID because thats unique but OrderDate isnt. So you cant compare LastUpdatedate with the Max Last UpdateDate without having an ID to say that its the same
row?
This has completely thrown me
And in your next question, the thing is:
@DebbieE wrote:And then the more I think about it this still doesnt make sence. Because when you import on the 12th
03/01/2020 11/01/2020
We already did this but surely because the date is different it will be deleted and readding again.
03/01/2020 11/01/2020
is already in the Power BI dataset because this was updated the 11th, right?
So from the 11th to the 12th there is no change (the dates are the same), so it won't update OrderDate 03/01/2020.
Nope, completely lost. Unless it understands that that row is the same row because there may be thousands of OrderDates of 03/01/2020 and there may be another record of 03/01/2020, that did get updated on the 11th but also gets updated on the the 12th
I should also say that its not just me that is struggling to understand this. Ive been speaking to alot of people who cant grasp this concept just like myself. They dont feel they can use this until they fully understand whats going on. And unfortunately it looks like I cant help because it doesnt seem right to me.
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Power BI update to learn about new features.
User | Count |
---|---|
51 | |
26 | |
14 | |
14 | |
12 |
User | Count |
---|---|
107 | |
38 | |
25 | |
23 | |
19 |