After some months of struggling to set up a working incremental refresh solution on my SQL DB source, I came to a sad conclusion that it can't be done. Power BI does not seem to be able to handle a scenario which is extremely common - record updates. I write this post to:
So my supposedily typical scenario is the following:
All this data (around 30M records, so not so few) is pulled to Power BI every night. Having heard about the incremental update feature, I proceeded to configure the following:
I assumed that Power BI will query DB to get all records modified in the last 7 days, and subsequently update the records it stores in its cache. Source is a SQL, there's a primary key, it all seemed natural. But I learned that this is not how Power BI actually works.
It removes 7 days worth of data from the local cache and replaces it with 7 days worth of data just queried from SQL. Primary key is never considered. Effectively it's creating duplicates. The original record is in the cache and the new updated record is added during the incremental refresh. So I'm getting key duplication errors and so on.
Am I correct so far in my understanding of how Power BI works?
If I am correct, is there a way to handle updates? I'm guessing that the update scenario must be super common so I struggle to imagine that this is simply not possible to configure/fix somehow.
For people who would suggest "Detect Data Changes" functionality, I'd like to say that I tried it out. I divided my 5 years of data to 60 monthly partitions and configured Power BI to get 60 months worth of data based on "CreatedAt" field but to detect data changes (in each partition) based on the "ModifiedAt" value. Unfortunately the nature of the data is that in the period of 24 hours there is an update of at least 1 record from each partition. So I'm effectively doing a full refresh this way.
Any ideas, People? 🙂
Thanks in advance!
@PeKaa wow I arrived a bit late at this sad party 🙂
Here I was setting up incremental loading for big facts tables where a single fact could be updated after a period of time that exceeds the rolling period for the daily refresh. I end up with doubles like reported by you...
This indeed seems like something basic that should just work based on the basic upsert/merge approach we are used to in data warehousing scenarios, but probably that would diminish the performance gain you would get with the partition set up.
The only thing I could think of is that:
- I would run during the week the incremental processing causing some data quality issues but keeping an acceptable duration.
- schedule for this specific type of table a full process of all partitions during the weekend. This would require, I think, Premium per User setup so that you could run xmla statements that ignore the refresh policy à la:
Indeed the capture data change option can be interesting when only a portion of the partitions get touched because of an update, but wondering where the tipover point is on performance when you would just do a full reprocess.
Did you find another way?
Hi @PaginatedDino , unfortunately not. I ended up optimizing my report which is currently allowing me to run full refreshes on a daily basis. The main change was to denormalize my model so that 2 tables (each 25M rows) have been combined into one. This made all the difference in the world, as previously I couldn't even refresh effectively (see my other thread here: https://community.powerbi.com/t5/Service/Never-ending-incremental-refresh-in-the-Service/m-p/2851829...).
This has all been quite depressing, so I don't want to hear about the incremental refresh any more, at least until they develop an update mode 🙂 I get your point about diminishing the performance gain, but on the other hand if you use a primary key and filter by an indexed date column, it should work pretty fast and achieve what is needed. But I guess we will never know... no one from MS ever answered any of my questions here on the forum and no one ever answered a feedback I sent to Power BI survey. Moreover I've worked on this with an MS consultant and they haven't been able to realize the problem for 3 months, until I found it myself by guessing how this whole thing works. Sorry about this depressing answer 😞
Good luck with your refresh! 🙂
Hey @PeKaDe thanks for the reply, sending virtual cheer up fist bump 😛 from time to time we bump in these kinds of "puddings". The main thing I take away from this is a scale from upate possibility to determines how interesting incremental loading could be:
1 > only data(typically events) that never gets updated hence perfect for incremental (partitioning)
50> partially data that never or does get updated beyond the configured rolling refresh period hence incremental loading + change data capture could work out
100) hightly volatile data where there is not predicatble scattering of when something gets updated after original insert and thus would cause maybe the change data capture to fetch all most of the historical partitions anyway.
Anyway good to hear it works and have a nice weekend
Hi again, I'd love to hear from someone who had this issue and somehow managed to solve it. I can deliver some additional insights from comparison testing.
I used a test environment for these test, which means there's 20M records and data is not updated so often. Actually from 5 years worth of data (=60 months) only 1 monthly partition is modified each day and needs updating.
The testing results shocked me and I'm on the verge of thinking that incremental refresh (detect data changes) is close to unusable. I'll be very happy to be corrected - that's the whole point of my posting here, to find a flaw in what I'm doing and a solution of course.
Case 1 - I configured full refresh and started it from a schedule. It took 9 minutes to refresh 20M records.
Case 2 - I configured incremental refresh as discussed before. I marked the "detect data changes" option and set Power BI to potentially refresh (if there were changes) 60 months into the past.
Surprise no. 1 - Initial full refresh took 53 minutes. That's the same full refresh that without this setting takes 9 minutes. I read the documentation at least twice and I understand that the only other thing happening there, is storing Max(ModifiedAt) for each of the 60 partitions.
Surprise no. 2 - Incremental refresh took 39 minutes. It needed to refresh just 1 monthly partition. That's more or less 2% of the data. We kept looking at the DB and couldn't see anything that would give us a clue on what has been happening.
Needless to say I have indexes on both CreatedAt & ModifiedAt. They have been rebuilt before the testing took place. The same queries that Power BI is running execute on the DB within a couple of seconds.
Please help anyone 🙂
@PeKaDe So your understanding of how Power BI incremental refresh works is essentially correct. A partition is created based upon the range parameters. In your case, I would think you would want to set the number of days refreshed based upon the CreatedAt column for 1 day. This would create a partitition for each day in the dataset. That way, going forward with the Detect Changers based upon the ModifiedAt value it would only refresh the daily partititions for the particular days where data changed.
This also might be interesting reading for you: Advanced incremental refresh and real-time data with the XMLA endpoint in Power BI - Power BI | Micr...
Hi @Greg_Deckler , thank you for a quick answer. I'm glad that I understood correctly how the incremental refresh works. I find it really surprising as I don't think there is so many cases that Power BI's incremental refresh can cover without doing proper updates based on the key. Do you know if perhaps Microsoft is working on solving this?
If not, I'd say this should be very prominent in their documentation as a very important limitation of the solution.
Anyway, your suggested workaround couldn't work - Microsoft is supporting values from 1 to 120. 5 years = 60 months = 1800 days, so I can't really go with that.
Any other suggestions? Perhaps someone had a similar problem and solved it?
Check out the November 2023 Power BI update to learn about new features.
Read the latest Fabric Community announcements, including updates on Power BI, Synapse, Data Factory and Data Activator.
130+ sessions, 130+ speakers, Product managers, MVPs, and experts. All about Power BI and Fabric. Attend online or watch the recordings.