Incremental Refresh with Slow-Changing Source Data

lbendlin · ‎07-15-2020

Incremental refresh is based on the cold/warm/hot partition principle.

Annotation 2020-07-16 203110.png

Based on your settings older data gets pushed out into ever larger partitions while the newer data is kept in many small partitions. The example below also shows some empty partitions because the source data comes in weekly installments, but the Power BI Service creates daily partitions for newer data as instructed (there is no "weeks" option in the settings).

Annotation 2020-07-16 203413.png

For the business data, there are many instances where data in the "old" partitions changes too. Worst case this would require a full dataset refresh, also affectionately known as "flush and fill" (as will also happen when you make structural changes).

To avoid that we need to think how we can identify the impact of the data changes. The obvious candidate is the "Last Modified Date" for a row of data.

What if we had the ability to identify if anything has changed in an existing section of the source data? For example, an order that was placed in May has now been marked as canceled. Our query against the data source can pull all the items with a Last Modified Date in the last 14 days which would automatically include the newly added items as well as the modified items.

Can you see the pitfall yet? The query would not include items that were removed in that period!

So, surprisingly the row level Last Modified Date is unsuitable for incremental refresh in this scenario! We need to take the concept a bit higher - the last modified date of a chunk of data will have to act as a substitute.

Another example: Transactions are pulled from the source system into CSV files by fiscal quarter. Comparing the size of the CSV file with the previous extract for the same fiscal quarter, or better yet comparing the content of the two files, we can identify if there were any changes to the data in that quarter. Some ways to implement that check would be through Power Automate, or through PowerShell. In a cinch even a good old DOS batch can do.

Now, if we set up Incremental Refresh with partitions per quarter, we can use the Last Modified Date of the CSV file (rather than the data inside it) as the trigger for the refresh. By making the CSV files reasonably sized (for example one file per quarter, per month or - as in this example - per week) we can keep the actual partial flush and fill to a reasonable size too. Think of it as a selective flush and fill rather than an incremental refresh.

The documentation on Incremental Refresh says that it requires a data source that supports query folding. This is supposedly required to properly include the RangeStart and RangeEnd parameters in the query to the data source. If you set up incremental refresh for a different data source you will get a warning.

Annotation 2020-07-16 204247.png

"Not recommended" doesn't mean impossible. So, let's plough on!

Key to incremental refresh is that your data source query has to include the RangeStart and RangeEnd filters. The parameter values don't have to cover the whole range though - you can use that to your advantage by specifying a narrow range in the .PBIX file for testing on a subset of the data.

Annotation 2020-07-16 204827.png Annotation 2020-07-16 204806.png

Now for the actual Power Query code. Note the filter for RangeStart and RangeEnd:

let
   Source = SharePoint.Files("https://xxx.sharepoint.com/sites/yyy", [ApiVersion = 15]),
    #"Filtered Rows" = Table.SelectRows(Source, each (
        [Folder Path] = "https://xxx.sharepoint.com/sites/yyy/Shared Documents/" 
        and Text.StartsWith([Name], "zzz")
        and Text.EndsWith([Name], ".csv")
        and [Date modified] > RangeStart
        and [Date modified] <= RangeEnd)
   ),
   #"Renamed Columns" = Table.RenameColumns(#"Filtered Rows",{{"Name", "Source.Name"}}),
   #"Get File Contents" = Table.AddColumn(#"Renamed Columns", "CSV", each GetFile([Content])),
   #"Removed Other Columns" = Table.SelectColumns(#"Get File Contents",{"Source.Name", "CSV"}),
   #"Expanded CSV" = Table.ExpandTableColumn(#"Removed Other Columns", "CSV", Columns, Columns)
in
   #"Expanded CSV"

GetFile[Content] is a subfunction that pulls the targeted CSV file contents, throws away unneeded columns and assigns some column types. Columns is a predefined list of the columns that we are interested in. This avoids having to read a "sample" file to get the header data.

We can see the finish line, but here's where things get a little tricky. There are some transforms that work better with incremental refresh on the service than others. Notable Table.Addcolumn seems to be much more benign than Table.NestedJoin. If the incremental refresh doesn't work, then you can look at your Power Query code and try different variations of the commands to see if they make a difference. The above code version works reliably since a couple of months.

Annotation 2020-07-16 210323.png

Time savings are astonishing. Instead of having to refresh the full set of data (currently 11 GB) the incremental refresh has to work on a maximum of 1 GB of CSV files and finishes in a couple of minutes.

I hope this article helps those of you considering incremental refresh in your decision-making process. Incremental refresh is not a universal cure for dataset refresh slowness, but it is an important tool that you want to have in your toolkit.

ROCKYDO12 · ‎06-28-2022

Thank you for this. I just wanted to confirm that I have to for new data it has to be stored in the source system. Even if it's on a shared folder that portioned data needs to be there. What I mean is if that data gets over written on a weekly basis, a copy has to be added to the folder. Even if I appended the new data to the old before new data gets overwritten power bi wouldn't store the old?

Javierco · ‎05-30-2023

Thanks for the article @lbendlin ! My files are stored in Azure DataLake Gen2 with a folder structure like /1209/appData/tasks_scheduled/YYYY/MM/DD/[filename.csv] (where YYYY stands for year, MM month and DD day) and my M query looks like this:

let
    Source = AzureStorage.DataLake("https://xxx.dfs.core.windows.net/1209/appData/tasks_scheduled/"),
    #"Filtered Hidden Files1" = Table.SelectRows(Source, each [Attributes]?[Hidden]? <> true),
    #"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File (80)", each #"Transform File (80)"([Content])),
    #"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
in
    #"Renamed Columns1"

How can I modify your M query to take advantage of incremental refresh and the folder structure?

Thank you!

lbendlin · ‎05-30-2023

You need to grab your folder path information and convert it into a datetime value that you can compare to RangeStart and RangeEnd.

Incremental Refresh with Slow-Changing Source Data

How to provide sample data in the Power BI Forum

Unveiling the Power of Lakehouse's SQL Analytics E...

Re: Power BI Tip: How To Sort A Column With Duplic...

Enhanced Data Profiling in Power Query: GUI and Ta...

Important Links every Power BI enthusiastic shoul...