De-duplicating incrementally refreshed data

najmead · ‎08-10-2022

I have an issue with slow dataset loads, and have been exploring the use of incremental refreshes to improve performance. However, my data is quickly getting out of sync with source. My understanding of the incremental refresh process is that assuming I set a refresh window of 1 month, if a record that is created two months ago gets updated in source, it will essentially get loaded a second time -- since the original record has now fallen outside the refresh window. This would certainly explain why my data is getting out of sync.

Obviously, I can extend the window, but I'm finding this doesn't really solve the problem -- in order to extend it sufficiently to cover my needs, performance starts to degrade again.

Consequently, is there any reason I can't de-duplicate the data with a step in Power Query? eg, if I add steps after the filter for RangeStart and RangeEnd to sort by last_mod_date and then deduplicate by the primary key... will that work? Or does the Power Query only apply on the partition that is being refreshed?

lbendlin · ‎08-11-2022

Obviously, I can extend the window, but I'm finding this doesn't really solve the problem

You may be able to find a balance there that works good enough.

In any case, you are hitting on the true core of the issue here - incremental refresh is exactly that - incremental. It expects data to be immutable once it has been added. What you expect to get (differential refresh) is not what incremental refresh is about, and is not what it can deliver.

There is a small remedy in using the "detect data changes" feature but it comes with a gigantic price tag. I'll let you find out what that is.

What you want to do is continue with your incremental refresh, and occasionally (once a month for example) force a reprocess of a couple of the "older" partitions as well. Doesn't even have to be a complete flush and fill of the dataset , you can use the XMLA endpoint to target individual partitions.

najmead · ‎08-12-2022

@lbendlin wrote:
There is a small remedy in using the "detect data changes" feature but it comes with a gigantic price tag. I'll let you find out what that is.

I'm sorry, but I have no idea what this is referring to.