cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
najmead
Frequent Visitor

De-duplicating incrementally refreshed data

I have an issue with slow dataset loads, and have been exploring the use of incremental refreshes to improve performance.  However, my data is quickly getting out of sync with source.  My understanding of the incremental refresh process is that assuming I set a refresh window of 1 month, if a record that is created two months ago gets updated in source, it will essentially get loaded a second time -- since the original record has now fallen outside the refresh window.  This would certainly explain why my data is getting out of sync.

 

Obviously, I can extend the window, but I'm finding this doesn't really solve the problem -- in order to extend it sufficiently to cover my needs, performance starts to degrade again.

 

Consequently, is there any reason I can't de-duplicate the data with a step in Power Query?  eg, if I add steps after the filter for RangeStart and RangeEnd to sort by last_mod_date and then deduplicate by the primary key... will that work?  Or does the Power Query only apply on the partition that is being refreshed?

2 REPLIES 2
lbendlin
Super User
Super User

Obviously, I can extend the window, but I'm finding this doesn't really solve the problem 

You may be able to find a balance there that works good enough.

 

In any case, you are hitting on the true core of the issue here - incremental refresh is exactly that - incremental.  It expects data to be immutable once it has been added.  What you expect to get (differential refresh) is not what incremental refresh is about, and is not what it can deliver.

 

There is a small remedy in using the "detect data changes"  feature but it comes with a gigantic price tag. I'll let you find out what that is.

 

What you want to do is continue with your incremental refresh, and  occasionally (once a month for example) force a reprocess of a couple of the "older" partitions as well.  Doesn't even have to be a complete flush and fill of the dataset , you can use the XMLA endpoint to target individual partitions.


@lbendlin wrote:

There is a small remedy in using the "detect data changes"  feature but it comes with a gigantic price tag. I'll let you find out what that is.

I'm sorry, but I have no idea what this is referring to.

 

Helpful resources

Announcements
PBI Sept Update Carousel

Power BI September 2023 Update

Take a look at the September 2023 Power BI update to learn more.

Learn Live

Learn Live: Event Series

Join Microsoft Reactor and learn from developers.

Dashboard in a day with date

Exclusive opportunity for Women!

Join us for a free, hands-on Microsoft workshop led by women trainers for women where you will learn how to build a Dashboard in a Day!

Top Solution Authors
Top Kudoed Authors