Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Have a challenge for folk out there. Maybe there's a much simpler way to do this and I'm not realizing it, but here is the scenario.
I have a table of sales that is right around 1 MM rows. It is a sales order table with 13 columns.
The two important columns for this calc are: Customer & OrderNbr
I am calculating 12 Month Repurchase Rates.
The formula for this is simply = Total Repeat customers / Total Customers = Repurchase Rate. If I have 100 customers and 10 buy more than once, I have a 10% customer repurchase rate.
Now, I have a chart, and on the X axis is fiscal period. In each fiscal period, I am calculating the 12 month repurchase rate.
I am doing so by using virtual tables.
My calculations are:
VAR repeatTable =
Solved! Go to Solution.
Thanks for testing those out!
Good to get an idea of relative performance, and glad some of them are performing better, but it feels as though performance should be much better!
Period Table filtering
Looking again at the code you posted earlier to determine the date range to filter, have I understood correctly that 'Period Table' your sole date table?
I think you could improve performance by rewriting the logic to filter 'Period Table' like this:
VAR curPeriodIndex =
-- I would generally prefer MAX rather than MIN
-- Doesn't affect performance but makes more sense if filtering
-- on multiple Periods.
MAX ( 'Period Table'[Period Index] )
VAR startPeriod =
curPeriodIndex - 11
RETURN
CALCULATE (
< Some Measure >,
ALL ( 'Period Table' ),
'Period Table'[Period Index] >= startPeriod,
'Period Table'[Period Index] <=curPeriod
)
This version removes filters on 'Period Table' then applies filters to the Period Index column (rather than Date column). The original version using FILTER ( ALL ( 'Period Table' ), ... ) is an iteration over the entire 'Period Table' which can be expensive.
Would you be able to post a model diagram, or a PBIX with an empty 'All Order Table', and I can generate a fact table at my end?
Repeat Customers calculation itself
Going back to the different DAX options for Repeat Customers, some ideas occurred to me, that I probably should have thought of earlier!
Version 5
Uses GENERATE to remove Customers whose first & last order are the same.
If FirstOrder = LastOrder, then EXCEPT ( FirstOrder, LastOrder ) is empty, and that Customer's row won't appear in result.
Repeat Customers Version 5 =
VAR RepeatCustomers =
GENERATE (
VALUES ( 'All Order Table'[Customer] ),
VAR FirstOrder =
FIRSTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
VAR LastOrder =
LASTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
RETURN
EXCEPT ( FirstOrder, LastOrder )
)
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Version 6
Use HASONEVALUE to see if there is not exactly one OrderNbr for a given Customer. This might be optimised to stop counting when it knows there are 2+ values.
Repeat Customers Version 6 =
VAR RepeatCustomers =
FILTER (
VALUES ( 'All Order Table'[Customer] ),
NOT CALCULATE ( HASONEVALUE ( 'All Order Table'[OrderNbr] ) )
)
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Version 7
Same as Version 6 but use COUNTROWS (this is really the same logic as Version 3):
Repeat Customers Version 7 =
VAR RepeatCustomers =
FILTER (
VALUES ( 'All Order Table'[Customer] ),
NOT CALCULATE ( COUNTROWS ( 'All Order Table' ) ) = 1
)
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
I'm hoping some of this gets us closer to acceptable performance!
Regards,
Owen
We took this offline, and @OwenAuger was able to get my query down to 6 seconds!
He created a new measure similar to some of the ones above, but first created a new Customer Table, so that the main calculate argument worked off of that table instead of the order table. This, I think, had the largest impact on performance.
His key notes on changes;
Here was the 'winning' formula:
12 Month Repurchase Rate Optimized =
VAR EndPeriodIndex =
MAX ( 'Period Table'[Period Index] )
VAR StartPeriodIndex = EndPeriodIndex – 11
-- Create explicit list of 12 Period Index values
VAR PeriodFilter =
TREATAS (
GENERATESERIES ( StartPeriodIndex, EndPeriodIndex ),
'Period Table'[Period Index]
)
RETURN
CALCULATE (
-- Count Customers with exactly one order
VAR NumSingleOrderCustomers =
COUNTROWS (
FILTER (
VALUES ( 'Customer'[Customer] ),
// Another option that may perform better if customers are "sparse"
-- SUMMARIZE ( 'All Order Table', Customer[Customer] ),
CALCULATE (
COUNTROWS ( 'All Order Table' )
) = 1
// Another option that didn't perform as well
-- CALCULATE ( HASONEVALUE ( 'All Order Table'[OrderNbr] ) )
)
)
VAR AllCustomers =
DISTINCTCOUNT ( 'All Order Table'[Customer] )
-- Subtract to get NumRepeatCustomers
VAR NumRepeatCustomers = AllCustomers - NumSingleOrderCustomers
VAR MMT =
DIVIDE ( NumRepeatCustomers, AllCustomers )
RETURN
MMT,
-- Apply PeriodFilter to above calculation
REMOVEFILTERS ( 'Period Table' ),
PeriodFilter
)
HUGE kudos to @OwenAuger ! Thanks a lot!
Nice, Owen!! Great variations here!
My notes:
First of all, notes about the period table:
CALCULATE (
< Some Measure >,
ALL ( 'Period Table' ),
'Period Table'[Period Index] >= startPeriod,
'Period Table'[Period Index] <=curPeriod
)
Typically, instead of the above, which I have used in the past, I have been using
CALCULATE (
< Some Measure >,
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
)
This seems to be slightly faster. I also have been using it when I might have some other filter active that I don't want to strip away, and only want to strip away the date (so avoids using ALL). I tested the two variations above on your version #5, and performance was slightly improved (6.3 seconds vs 7.2 seconds) - that was giving the wrong data, but the efficiency test was still accurate. I'll explain more below
I also want to be clear that I'm also using this for the "all customer" calculation here:
VAR allCust =
CALCULATE(
DISTINCTCOUNT('All Order Table'[Customer]),
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
)
When running this on its own, it takes only about 1.5 seconds. So I know there isn't a problem with this part of the code.
So speaking of the versions, Version 5 was definitely the best yet at 6.3 seconds!
Just to be clear, here is what the code looks like with the modification
VAR RepeatCustomers =
GENERATE (
CALCULATETABLE(
VALUES ( 'All Order Table'[Customer] ),
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
),
VAR FirstOrder =
FIRSTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
VAR LastOrder =
LASTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
RETURN
EXCEPT ( FirstOrder, LastOrder )
)
VAR NumRepeat =
COUNTROWS ( RepeatCustomers )
This took only 6.2 seconds. But then I realized the data was incorrect. Again, because the FIRSTNONBLANK was removing the filter context of DATESBETWEEN. So I had to add another calculate there. This threw us back down to 26 seconds calc time, and it looked like this:
VAR RepeatCustomers =
GENERATE (
CALCULATETABLE(
VALUES ( 'All Order Table'[Customer] ),
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
),
VAR FirstOrder =
CALCULATETABLE(
FIRSTNONBLANK ( 'All Order Table'[OrderNbr], 0 ),
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
)
VAR LastOrder =
CALCULATETABLE(
LASTNONBLANK ( 'All Order Table'[OrderNbr], 0 ),
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
)
RETURN
EXCEPT ( FirstOrder, LastOrder )
)
VAR NumRepeat =
COUNTROWS ( RepeatCustomers )
This version blew my mind a little bit, Owen! I didn't realize you could write VAR INSIDE a calculated virtual table?!?! It really threw me off. Can you help me understand a little bit how this works?
I think the "GENERATE" calc throws me off a little bit, because I haven't used it much before. Only generate series.
I see that you first create a one column unique table of customers. Then my hunch here is that the FirstOrder "VAR" table gets applied at the row level to the unique customers? Or is it creating another table next to the first and applying a filter from the first of the customer? This is throwing me a bit. And then the EXCEPT is working on FirstOrder and LastOrder table, but then it applies the filter back to the customer table?
Version 6 and Version 7 - these I understand how you're calculating, but again, really smart usage here. Version 6 took about 19 seconds, but 6 was only at 14 seconds. I would not have thought of using calculate (didn't think it was possible) as a filter argument! Genius! Unfortunately, again, I had to add the DATESBETWEEN context to get this to work. And I had to add it to the VALUES table and to the CALCULATE formula. So to be clear, here is the final version:
VAR RepeatCustomers =
FILTER (
CALCULATETABLE(
VALUES ( 'All Order Table'[Customer] ),
DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod)
),
NOT CALCULATE ( COUNTROWS ( 'All Order Table' ),DATESBETWEEN('Period Table'[Day],startOfPeriod,endOfPeriod) ) = 1
)
VAR NumRepeat =
COUNTROWS ( RepeatCustomers )
As for building out a PBIX, happy to do so, but I can't post the files on the forum - it won't let me.. Thought you could. Anyway, I can send you a PM and could email it to you, or if there is another way to upload here, happy to do that as well.
Hey Owen! Thanks for replying so quickly!
Here are my comments:
The use of period table = we work on a 4-4-5 Fiscal Period Cal that starts in July. The period table is a reference/calendar table that stores all of the company fiscal weeks, periods, and years. Yes, one row per day.
"Ideally, you should have a Date table containing one row per date, with a relationship with 'All Orders Table'. If set up correctly, this will remove the need to construct a date filter using DATESBWEEN, and hopefully improve performance. (There could be something I'm missing here though.)"
-Yes, this is how this is built, and the relationship is setup. However, because the chart is broken out into periods, each period, because of the relationship, is filtered to that period.
For example, the X-Axis point at Period 9 of FY22, "202209" is filtering period table (and thus order date table) to only dates in 202209. This means I have to use something like DATESBETWEEN to remove that filter contex, and instead apply the new filter criteria which is defined by the variables you mentioned:
)
So it's finding first day in the period 11 months prior, and last day of current period to get the 12 month range.
Again, the "DATESBETWEEN" is removing our 'row context' of 202209 (dates only occurrring in FY 2022 Period 9), and applying the new context of all dates in last 12 months.
Okay, so those notes aside, I attempted the variation you supplied, and I'm still getting a 15 second load time. I do appreciate what you've done here though! Counting rows vs counting records, always better. I didn't understand that applying "CALCULATE" to the start of the new summarized column would apply the row context back to the COUNTROWS function. Brilliant! Also appreciate the minor tweak of using DIVIDE instead of "/" - I always forget to do that.
This did give me an idea to attempt removing filter context from PERIOD TABLE completely, then applying filter directly to the Order Table. But that didn't really improve efficiency at all.
Hi again @Hazenm
Thanks for testing it out, and providing further explanation of how your data model is set up.
It looks like my suggestions haven't addressed the core performance issue.
I think if we can focus on the calculation of "number of repeat customers" in isolation, that might help get to the bottom of the issue. I think the Period/Date filtering is probably not hugely impacting performance.
I've come up with some "Repeat Customers" measures to test below, as this seems to be the expensive part of the calculation.
Could you create the below measures and see how they perform, either in a test visual grouped by Year/Month or some other dimension(s)?
I was using DAX Studio myself to test these in a sample model, so you could also do something similar.
Interested to hear how performance is with these, and whether any of them is an improvement!
Regards,
Owen
Repeat Customers Version 1
Repeat Customers Version 1 =
// SUMMARIZE by Customer/OrderNbr
VAR CustomerOrder =
SUMMARIZE (
'All Order Table',
'All Order Table'[Customer],
'All Order Table'[OrderNbr]
)
// GROUPBY to count orders per customer
VAR CustomerOrderCount =
GROUPBY (
CustomerOrder,
'All Order Table'[Customer],
"@Orders", SUMX ( CURRENTGROUP (), 1 )
)
VAR RepeatCustomers =
FILTER ( CustomerOrderCount, [@Orders] > 1 )
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Repeat Customers Version 2
Repeat Customers Version 2 =
// ADDCOLUMNS / SUMMARIZE / DISTINCTCOUNT to get orders per customer
VAR CustomerOrderCount =
ADDCOLUMNS (
SUMMARIZE ( 'All Order Table', 'All Order Table'[Customer] ),
"@Orders", CALCULATE ( DISTINCTCOUNT ( 'All Order Table'[OrderNbr] ) )
)
VAR RepeatCustomers =
FILTER ( CustomerOrderCount, [@Orders] > 1 )
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Repeat Customers Version 3
Repeat Customers Version 3 =
// ADDCOLUMNS / SUMMARIZE / COUNTROWS to get orders per customer
VAR CustomerOrderCount =
ADDCOLUMNS (
SUMMARIZE ( 'All Order Table', 'All Order Table'[Customer] ),
"@Orders", CALCULATE ( COUNTROWS ( 'All Order Table' ) )
)
VAR RepeatCustomers =
FILTER ( CustomerOrderCount, [@Orders] > 1 )
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Repeat Customers Version 4
(a bit of an experiment - don't expect to perform too well)
Repeat Customers Version 4 =
// SUMMARIZE by Customer/OrderNbr
VAR CustomerOrder =
SUMMARIZE (
'All Order Table',
'All Order Table'[Customer],
'All Order Table'[OrderNbr]
)
// Find First Order for each Customer
VAR CustomerFirstOrder =
GENERATE (
VALUES ( 'All Order Table'[Customer] ),
FIRSTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
)
// Take set difference between CustomerOrder & CustomerFirstOrder
VAR CustomerExceptFirstOrder =
EXCEPT ( CustomerOrder, CustomerFirstOrder )
VAR RepeatCustomers =
SUMMARIZE ( CustomerExceptFirstOrder, 'All Order Table'[Customer] )
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Owen!
Thanks so much for putting these together! This was really nice!
Here are the results:
First off, I had to add the DATESBETWEEN function to each of these variations to allow them to calc the correct number.
Version One:
This was the best one. After applying the datesbetween function, it took 9 seconds originally. I made some minor tweaks to try to reduce the load, but I couldn't get it much below 9. For example, the original summarize is not required, because the order table is already one row per order. So I changed the summarize to just:
I thought this would reduce performance further, but no change. I think there might be a way here to get it down, but already, this is much improved from last results.
VERSION TWO:
After applying datesbetween, this one took just under 19 seconds
VERSION THREE:
VERSION FOUR:
I love the creativity in this process. This is what I was trying to come up with when I was originally trying to develop some way to create a NON-DISTINCT formula.
Unfortunately, as you predicted, this took longer, at 23 seconds.
I even attempted to play with it a little bit to make it faster. The GENERATE table part of the formula was taking quite abit, so I changed it to this:
Thanks for testing those out!
Good to get an idea of relative performance, and glad some of them are performing better, but it feels as though performance should be much better!
Period Table filtering
Looking again at the code you posted earlier to determine the date range to filter, have I understood correctly that 'Period Table' your sole date table?
I think you could improve performance by rewriting the logic to filter 'Period Table' like this:
VAR curPeriodIndex =
-- I would generally prefer MAX rather than MIN
-- Doesn't affect performance but makes more sense if filtering
-- on multiple Periods.
MAX ( 'Period Table'[Period Index] )
VAR startPeriod =
curPeriodIndex - 11
RETURN
CALCULATE (
< Some Measure >,
ALL ( 'Period Table' ),
'Period Table'[Period Index] >= startPeriod,
'Period Table'[Period Index] <=curPeriod
)
This version removes filters on 'Period Table' then applies filters to the Period Index column (rather than Date column). The original version using FILTER ( ALL ( 'Period Table' ), ... ) is an iteration over the entire 'Period Table' which can be expensive.
Would you be able to post a model diagram, or a PBIX with an empty 'All Order Table', and I can generate a fact table at my end?
Repeat Customers calculation itself
Going back to the different DAX options for Repeat Customers, some ideas occurred to me, that I probably should have thought of earlier!
Version 5
Uses GENERATE to remove Customers whose first & last order are the same.
If FirstOrder = LastOrder, then EXCEPT ( FirstOrder, LastOrder ) is empty, and that Customer's row won't appear in result.
Repeat Customers Version 5 =
VAR RepeatCustomers =
GENERATE (
VALUES ( 'All Order Table'[Customer] ),
VAR FirstOrder =
FIRSTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
VAR LastOrder =
LASTNONBLANK ( 'All Order Table'[OrderNbr], 0 )
RETURN
EXCEPT ( FirstOrder, LastOrder )
)
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Version 6
Use HASONEVALUE to see if there is not exactly one OrderNbr for a given Customer. This might be optimised to stop counting when it knows there are 2+ values.
Repeat Customers Version 6 =
VAR RepeatCustomers =
FILTER (
VALUES ( 'All Order Table'[Customer] ),
NOT CALCULATE ( HASONEVALUE ( 'All Order Table'[OrderNbr] ) )
)
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
Version 7
Same as Version 6 but use COUNTROWS (this is really the same logic as Version 3):
Repeat Customers Version 7 =
VAR RepeatCustomers =
FILTER (
VALUES ( 'All Order Table'[Customer] ),
NOT CALCULATE ( COUNTROWS ( 'All Order Table' ) ) = 1
)
VAR NumRepeatCustomers =
COUNTROWS ( RepeatCustomers )
RETURN
NumRepeatCustomers
I'm hoping some of this gets us closer to acceptable performance!
Regards,
Owen
Hi @Hazenm
The logic of the calcuation looks good, but there should be room to improve performance here.
Repurchase Rate =
VAR CustomerNumOrders =
CALCULATETABLE (
ADDCOLUMNS (
SUMMARIZE ( 'All Order Table', 'All Order Table'[Customer] ),
"@NumOrders", CALCULATE ( COUNTROWS ( 'All Order Table' ) )
),
DATESBETWEEN ( 'Period Table'[Day], startOfPeriod, endOfPeriod )
)
VAR Customers =
COUNTROWS ( CustomerNumOrders )
VAR RepeatCustomers =
COUNTROWS ( FILTER ( CustomerNumOrders, [@NumOrders] > 1 ) )
VAR MMT =
DIVIDE ( RepeatCustomers, Customers )
RETURN
MMT
Does something like this work, and how is performance?
Regards,
Owen
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
45 | |
25 | |
19 | |
16 | |
8 |
User | Count |
---|---|
71 | |
49 | |
45 | |
20 | |
17 |