Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

skeets20

Partitioning of Aggregate tables into partial Direct Query or Import by age of data

Problem statement

Even narrow aggregate tables can exceed capacity limits for memory storage. In scenarios where multiple aggregated fact tables are required, the constraints become even more problematic. Additionally, developers have to wait for the large import tables to refresh locally before they can continue their work. In many cases, it's not neccesary for all data to be included in the in-memory aggregation. Older data that is retrieved infrequently could be "archived," however PBI doesn't have this functionality built into it.

 

Assumptions

This article assumes the reader is already familiar with all PowerBI storage modes (Direct Query, Import and Dual) as well as Aggregation tables, Calculation Groups and Tabular Editor. All of these tools and functionalities augment each other to make the solution work.

 

Introduction and Setup

First off, this is a long and in-depth article. I didn't want to just throw the solution out there without explaining how I got to it and some of the gotcha's I encountered along the way. Overall, the steps are actually pretty basic and simple. For this demonstration I’m using AdvertureWorksDW2016_EXT, PowerBI Desktop 2.96 and Tabular Editor 2.13.2. My PowerBI file is very trim since I’m only pulling in enough data to demonstrate the solution. The AdventureWorks database is relatively small so this solution might seem pointless, however, the datasets many of us typically work with are huge. Even when our fact tables (in the 10’s-100's of millions of rows) are reduced to a handful of dimensions in aggregate, the PBIX files are still too large to publish to the PowerBI service. There are many ways one can partition their data but the common scenario demonstrated here is partitioning by a date of some sort. In other words, we aggregate the most current data in memory and defer to direct query for historical data. For this demonstration I'm doing everything in the PowerBI file so that it is a working demonstration in itself and only needs to be connected to an instance of AdvertureWorksDW2016_EXT to function.

 

Why Bother?

Some of you might be asking “Why do this at all? Can’t we just limit the data in our aggregate table and let PBI defer to Direct Query if needed?” Aggregation tables don’t work that way, in that, they rely on the combination of relationships defined and in scope, not the presence or absense of data within the aggregated table. I’ll demonstrate this in more detail later to clarify.

 

The Data Model

 

This is the data model for the demonstration. It’s pretty basic and I’ve added a suffix to each table to indicate it’s storage mode as Direct, Import or Dual. Please, take note that “DimCurrency_Dual” and “DimCustomer_Dual” are not related to the aggregated table “aggFactInternetSales_Import”

“Measure Values” is an empty placeholder table for holding the measures in.

“ArchiveAfterDateKey” table is actually our parameter with the “enable load” option set.

 

skeets20_0-1636389708820.png

 

 

The model has an aggregate table defined as “aggFactInternetSales_Import” which is summary of “Freight”, “OrderQuantity”, “SalesAmount” and “TaxAmount” and are pre-aggregated in SQL by dimensions “Product”, “OrderDate”, “Sales Territory” and “Promotion.”

 

skeets20_1-1636390876484.png

 

 

Parameter Setup

To define the point in time we’d like to switch from in-memory data to Direct Query, we need a way to pass a scalar value to all of the tables that need to be filtered. As mentioned before, this can be done completely on the database side using tables and views but for this demo I’m using a PowerBI parameter to keep it all self contained in the PBIX.

Create the parameter

In Power Query, create a new parameter. For this demo, I created "ArchiveAfterDateKey" and I'm using the interger natural key for filtering

skeets20_2-1636391431856.png

 

Load the parameter as a table

This step makes our parameter available to the model which we'll need for a calculation group.

skeets20_3-1636391553271.png

 

 

Filter the Data

In the introduction I posed the "Why Bother?" question as to why we can’t just limit the data in the aggregation table and be done with it. Let’s explore that now. In the next several steps we are going to play with various filtering to show the different behaviors in action.

To begin the partition process, we need to filter our imported aggregation table so that it only includes recent data as defined by our parameter. The image on the right will guide you if needed. filterAggregateTable.gif

 

No matter what we do we are only getting recent data. This demonstrates the behavior mentioned in the "Why Bother?" Because "Order Date" is the only dimension in scope, and it's part of the aggregate, PowerBI is perfectly happy using the aggregate table even though there are a lot of periods missing.

 

FilteredTableResults.gif

 

 

Let's convince PowerBI to ignore the aggregate by using one of our non-aggregated dimensions.

Now that we are using a non-aggregated dimension, PowerBI ignores the aggregated table all together and choses Direct Query instead. All of our data is there now, including 2014 because we only filtered the data loaded into the aggregate table. The Direct Query table still has the full range of data.

 

FilterOnCustomerResults.gif

 

I mentioned in the Teaser that there are caveots to this solution. Since we filtered the aggregate table to only import recent periods, it seems logical to filter the Direct Query table to exclude those periods. We can explore that by applying the opposite filter for our parameter to the Direct Query table. Given the warning below, keep this step in mind. We will circle back to it and undo it towards the end.

WARNING! This will introduce a data quality issue. Don’t do this in your production or mission critical app!

 

FilterDetailTable.gif

 

Let's add two matrix visuals to our page with MonthEndDT (instead of DateDT) on rows and some measures on columns. We also need a slicer on MonthEndDT to play with. I've also updated my parameter to include year 2013 in aggregate for more data. You'll notice that both matrices are displaying the same "recent" data. This is expected at this point.

 

skeets20_0-1636416145872.png

 

 

The Calculation Group

Using Tabular Editor, create the calculation group.

I've called mine "PeriodPartition" and created two Calculation Items, "Archive" and "Current"

Rember how we set "Enable Load" on the PowerQuery parameter? This is where it comes in. We reference it as a variable in our DAX:

 

skeets20_1-1636416679271.png

 

 

 

 

 

 

 

VAR ArchiveDT =
MAX ( 'ArchiveAfterDateKey'[ArchiveAfterDateKey] )

 

 

 

 

 

 

The first calculation group "Archive" is where the magic really happens, and all it takes is a simple filter on the Direct Query table that resolves to "True" but doesn't actually eliminate rows. This convinces PowerBI that it can't use the aggregate table and needs to Direct Query instead.

 

 

 

 

 

 

VAR ArchiveDT =
    MAX ( 'ArchiveAfterDateKey'[ArchiveAfterDateKey] )
RETURN
    CALCULATE (
        SELECTEDMEASURE (),
        FILTER (
            'FactInternetSales_Direct',
            'FactInternetSales_Direct'[OrderDateKey] > 0 //force an always true filter on the DQ table. This tricks PBI to direct query
        ),
        FILTER ( 'DimOrderDate_Dual', 'DimOrderDate_Dual'[DateKey] < ArchiveDT )
    )

 

 

 

 

 

 

The "Current" calculation item is pretty basic and doesn't do anything but filter the aggregated table, which probably isn't neccesary since the table is pre-filterd. This is another caveot we will come back to as well.

 

 

 

 

 

 

VAR ArchiveDT =
    MAX ( 'ArchiveAfterDateKey'[ArchiveAfterDateKey] )
RETURN
CALCULATE (
    SELECTEDMEASURE (),
    FILTER ( 'DimOrderDate_Dual', 'DimOrderDate_Dual'[DateKey] >= ArchiveDT )
)

 

 

 

 

 

 

 

To apply this, we need to filter our matrices using the new calculation items as shown here.

With the calculation groups applied as filters you can see in Perfromance analyzer that the Archive and Current matrices perform the way we expect and each matrix contains data in respect to it's filter.

 

CalcGroupFilterInaction.gif

 

What we really want though is a single set, with both archived and current data combined as if it were sourced from one query while being transparent to the user.

To walk through this we will add a "Combined Data" matrix on the page. Now we start to see some of the caveots I mentioned. In the screen shot to the right, I've filtered "Combined Data" to use both values. I'd expected to see the union of both sets between 3/21/2011 and 12/31/2014. What's happening is that PowerBI is prefering the aggregate table because we have no non-aggregated dimensions in scope.

 

skeets20_3-1636418859406.png

 

 

Instead of trying to combine the two calculation groups for now, we'll create a third calculated item.

I've create a calculated item called "Either" with the following DAX:

 

skeets20_4-1636419329268.png

 

 

 

 

 

 

 

VAR ArchiveDT =
    MAX ( 'ArchiveAfterDateKey'[ArchiveAfterDateKey] )
RETURN
    IF (
        CALCULATE (
            SELECTEDMEASURE (),
            FILTER (
                'FactInternetSales_Direct',
                'FactInternetSales_Direct'[OrderDateKey] > 0 //force an always true filter on the DQ table.
            ),
            FILTER ( 'DimOrderDate_Dual', 'DimOrderDate_Dual'[DateKey] < ArchiveDT )
        ),
        CALCULATE (
            SELECTEDMEASURE (),
            FILTER ( 'DimOrderDate_Dual', 'DimOrderDate_Dual'[DateKey] >= ArchiveDT )
        )
    )

 

 

 

 

 

 

 

Here are the results after saving from Tabular editor and updating our "Combined Data" filter to use the "Either" calculation item instead of the other two.

Notice we are only seeing "Archive" data now. I've tried various DAX in the "Either" calculation item with no success so the next steps are my workaround.

 

skeets20_2-1637179349408.png

 

We need an indicator on the period to tell us if the period is archived or not.

To keep everything for this demo in PowerBI, I'm doing it with a calculated column on the table with the following DAX:

 

skeets20_6-1636419759394.png

 

 

 

 

 

 

 

IF (
    'DimOrderDate_Dual'[DateKey]
        < MAX ( 'ArchiveAfterDateKey'[ArchiveAfterDateKey] ),
    1,
    0
)

 

 

 

 

 

 

Now we just need to adjust the DAX in the "Either" calculation item to filter on the new calculated column instead. Adjust the DAX like so:

skeets20_1-1637179241644.png

 

 

 

 

 

 

VAR ArchiveDT =
    MAX ( 'ArchiveAfterDateKey'[ArchiveAfterDateKey] )
RETURN
    IF (
        SELECTEDVALUE ( 'DimOrderDate_Dual'[ArchivedPeriodFLG] ) = 1,
        CALCULATE (
            SELECTEDMEASURE (),
            FILTER (
                'FactInternetSales_Direct',
                'FactInternetSales_Direct'[OrderDateKey] > 0
            ),
            FILTER ( 'DimOrderDate_Dual', 'DimOrderDate_Dual'[DateKey] < ArchiveDT )
        ),
        CALCULATE (
            SELECTEDMEASURE (),
            FILTER ( 'DimOrderDate_Dual', 'DimOrderDate_Dual'[DateKey] >= ArchiveDT )
        )
    )

 

 

 

 

 

 

 

At this point, the combined set seems to be functioning just like we want it to. As the date slicer is modified, the data in the table adapts accordingly and it only does a Direct Query for dates outside of the "Current" partition.

Everything seems fine, but we haven't added a non-aggregated dimension to the scope. To do that, let's add Customer Geography to each of the grids. Notice in both our "Combined Data" and "Current Data" tables the data disappears. Because we introduced a dimension that isn't part of the aggregate, PowerBI gives full preference to Direct Query, and we filtered those records out using the parameter

 


dataqualityissue.gif

 

 

As you can see in the animation, our data is working as expected. Better yet, nothing is duplicated or doubled as a result.

 

subfinalresults.gif

 

Finally, we need to apply "SELECTEDMEASUREFORMATSTRING()" to the "Format String Expression" property of each calculation item to ensure those format strings carry through from the original measure.

 

setformatstring.gif

CONCLUSION

If it seems like the solution is “breaking” PBI’s aggregation functionality you are justified in thinking that. That’s really what is happening in that simple DAX logic in the calc group. We are forcing PBI to ignore the aggregate table because it lacks the rows we need, even though the relationships are satisfied for purpose of aggregation. You can get by with just the "Either” calculation item and apply it globally or on a per page/visual basis. That is the result we are really after, although the other two calculated items are useful for debugging.

Another point of consideration is training. Since this will be part of your Dataset, any consumers that dataset in their reports, whether it's PowerBI, Excel etc., will need to be educated on it's existence and when and how to use it.

 

This has been a long and very detailed article for what amounts to a few steps:

  1. Identify the dimension and key that will “break” your aggregated table where you want it to. Period, Date, Sequence etc.
  2. Add a Boolean indicator on that dimension with logic to determine the true/false value
  3. Filter your aggregated table so it only has one side of the Boolean result
  4. Create a calculation group with the calculation item(s) and write your DAX similar to the “Either” example.
  5. Apply the calculation item as a filter to your visuals or pages accordingly.