Solved: Re: Optimizing PERCENTILEX

AlB · ‎12-14-2020

Hi all,

Sample pbix attached.

We have a very simplified one-table model (see Table1 below) and and two simple measures:

Total revenue = SUM(Table1[Revenue])

PercentileM = 
PERCENTILEX.INC(ALL(Table1[Client_code]), [Total revenue], .25)

We place Client_code and [PercentileM] in a table visual. Note the definition of Table1 in the file. The number of clients can be changed in the code, VAR numClients_

If the number of clients is 50K, the visual takes about 10 seconds to update. If it's 100K it goes up to almost 40 seconds. If it's 1M, it explodes, takes over an hour and finally crashes.

2 questions

1. Is there a way to optimize this, avoid it taking so long for large N? See Page 1 in the attached file

2. (See page 2 in the attached file) Since the value provided by [PercentileM] is actually the same for each row, we're being incredibly inefficient by having it run for each of the N rows in the visual. In fact, the result we're after is what we get if we place [PercentileM] in a card visual, which is very fast. Is there a way to get the result from the card visual and place it in each row of the table visual? So that [PercentileM] is executed only once? To clarify what we mean, we could create a one-row, one-column calculated table MeasureResult with that result:

MeasureResult = {[PercentileM]}

and then a measure that just reads that static result to be used in the table visual. See this on page 2 of the attached file

ReadFromTable = MAX(MeasureResult[Value])

This is however an inflexible solution, since the calculated table is static and the user couldn't apply filters for instance.

Many thanks

@MFelix @Zubair_Muhammad @mahoneypat @TomMartens @GilbertQ @camargos88 @OwenAuger

Table1

Client_code	Revenue
1	1
1	2
1	3
1	4
2	1
2	2
2	3
2	4
3	1
3	2
3	3
3	4
…	…
N	1
N	2
N	3
N	4

mahoneypat · ‎12-16-2020

@AlB Here's an approach that gets you close, and might be further refined to get it just right. I added a random number to your original table to test it out (so not the totals were the same for all clients) like this

Table1 =

VAR numClients_ = 1000000 //Change here the number of clients

VAR clients5rows =

SELECTCOLUMNS (

CROSSJOIN (

SELECTCOLUMNS ( GENERATESERIES ( 1, numClients_ ), "Client_code", [Value] ),

GENERATESERIES ( 1, 5 )

),

"Client_code", [Client_code],

"Revenue", CONVERT([Value], DOUBLE)

)

var withrandom = ADDCOLUMNS(clients5rows, "number", RANDBETWEEN(1,10000))

return withrandom

Then made a measure that creates the virtual table, finds the rank, uses that to find the rank at 25% based on # of rows/clients, and then filters down to the sum on that row (or close to it). You could refine it to find the values above and below that level and interpolate like PERCENTILEX does. For 1M clients, this ran in 3.5 s on my machine.

Percentile M3 =
VAR vSummary =
    ADDCOLUMNS (
        ALL ( Table1[Client_code] ),
        "cSum",
            CALCULATE (
                SUM ( Table1[number] )
            )
    )
VAR vAddRank =
    ADDCOLUMNS (
        vSummary,
        "cRank",
            VAR vThisValue = [cSum]
            RETURN
                RANKX (
                    vSummary,
                    [cSum],
                    vThisValue,
                    ASC
                )
    )
VAR v25rank =
    0.25
        * COUNTROWS ( vSummary )
RETURN
    MINX (
        FILTER (
            vAddRank,
            [cRank] >= v25rank
        ),
        [cSum]
    )

Regards,

Pat

Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.

@mahoneypa HoosierBI on YouTube

View solution in original post

AlB · ‎12-19-2020

@TomMartens

Thanks. I look forward to your response

@mahoneypat

That's great! Thanks very much. The improvement is massive. I thought of something like that initially but assumed (wrongly, clearly) that the built-in implementation of PERCENTILEX would already be (quasi)optimized. I am really, really surprised that its performance is so poor compared to your solution 🤔

I made a couple of changes to account for the interpolation part. Not difficult but it took me a while to find online a clear definition of the exact algorithm EXCEL/DAX use. This is a first version:

PercentileINC M3 (PatMahoney's) V3 = //Added interpolation 
VAR wantedPerc_ = [Wanted_perc]
VAR baseT_ = ALL(Table1[Client_code])
VAR vSummary = ADDCOLUMNS ( baseT_, "cSum", CALCULATE ( SUM ( Table1[number] ) ) )
VAR vAddRank = ADDCOLUMNS ( vSummary, "cRank", RANKX (vSummary, [cSum],,ASC,Skip))
VAR percentileRank_ = (wantedPerc_ * (COUNTROWS(baseT_) - 1)) + 1 //Difference with .EXC
VAR percRankFloor_ = FLOOR(percentileRank_,1)
VAR percRankCeiling_ = CEILING(percentileRank_,1)
VAR ranksT_ = TOPN(percRankCeiling_-percRankFloor_+1, FILTER(vAddRank, [cRank]<=percRankCeiling_), [cRank], DESC) 
VAR val1_ = MINX(ranksT_,[cSum])
VAR val2_ = MAXX(ranksT_,[cSum])
VAR interpolated_ = val1_ + ((val2_ - val1_) * (percentileRank_ - percRankFloor_))
RETURN
interpolated_

Runs approx. in the same time as your version; so adding the interpolation has hardly any impact

And another version, some 20% faster than the previous one:

PercentileINC M3 (PatMahoney's) V4 = //Can it run faster than V3??
VAR wantedPerc_ = [Wanted_perc]
VAR baseT_ = ALL(Table1[Client_code])
VAR vAddRank =    ADDCOLUMNS ( baseT_, "cRank", RANKX (baseT_, CALCULATE (SUM ( Table1[number] )),,ASC,Skip))
VAR percentileRank_ = (wantedPerc_ * (COUNTROWS(baseT_) - 1)) + 1 //Difference with .EXC
VAR percRankFloor_ = FLOOR(percentileRank_,1)
VAR percRankCeiling_ = CEILING(percentileRank_,1)
VAR valsT_ = TOPN(percRankCeiling_-percRankFloor_+1, FILTER(vAddRank, [cRank]<=percRankCeiling_), [cRank], DESC) 
VAR auxT_ = ADDCOLUMNS(valsT_, "@Value", CALCULATE(SUM(Table1[number]))) 
VAR val1_ = MINX(auxT_,[@Value]) 
VAR val2_ = MAXX(auxT_,[@Value])
VAR interpolated_ = val1_ + ((val2_ - val1_) * (percentileRank_ - percRankFloor_))
RETURN
interpolated_

I will definitely mark your response as solution but want to leave this still open for discussion on question 2.

By the way, do you know if it's possible in DAX Studio to execute, for instance, a measure several times and get the average of the execution time? I.e., instead of executing the query manually 10 times, check the timings and extract the average manually, have DAX Studio do it automatically. I believe I'd read somewhere it was possible but cannot find where. Thanks

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers

View solution in original post

TomMartens · ‎12-22-2020

Hey @AlB , Hey @mahoneypat ,

we are facing two problems

the lack of ordered data structures (needed for PERCENTILE... RANKX, and many more things)
a measure will be evaluated inside a filter context (1M rows, 1M filter contexts). This is a ruling principle, for this, this can not be changed.

To overcome the first issue I started using a PATH object. My measure contains this line:

var p = CONCATENATEX( ALLSELECTED( 'Table1'[Client_code] ) , [Total revenue] , "|" , [Total revenue] , ASC )

p contains an ordered structure that can be used with all the PATH functions, like PATHLENGTH or PATHITEM.

The complete measure looks like this:

percentile with path and Client_code = 
var __Percentile = 0.25
var p = CONCATENATEX( ALLSELECTED( 'Client_code'[Client_code] ) , [Total revenue] , "|" , [Total revenue] , ASC )
var position = PATHLENGTH( p ) * __Percentile
var valueAtPosition = VALUE( PATHITEM( p , position , TEXT ) )
return
valueAtPosition

As you can see I'm also using a dedicated Client_code table. I recommend using a dedicated table. There is more than one row per client, for this reason the measure will benefit from filter propagation instead of permanently scanning the table.

The above measure excutes in ~3.2s on my machine (Intel i7-9750H). I consider this not a substantial gain over the solution @mahoneypat provided meaning, maybe we have to face the fact that we already reached the optimum 🙂

In regards to your 2nd question, maybe there is the possibility to create a table that just contains a single row using GROUPBY or SUMMARIZE. Then this table will calculated only once during data refresh. You can use a more simple measure to blend the PERCENTILE value into your fact table.

Stay safe, stay healthy, have data (and of course A Merry Christmas)

Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

View solution in original post

AlB · ‎12-22-2020

@TomMartens

Great. Thanks a lot. I will look at it over the following days and adapt it to include interpolation like PERCENTILEX.INC. Might come back with some additional questions

Thanks

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers

AlB · ‎12-22-2020

@TomMartens

Thanks a lot for your reply. This looks GREAT, very interesting. I just saw that the results your measure produces are not exactly the same as what PERCENTILEX.INC yields. I believe however that is simply the interpolation and probably will not take much to fix. A few questions. I know you are on holidays. No rush 😊

1. I haven't worked with PATH objects. I will do my own research but would you care to explain, high level, the logic behind the approach?

2. How do you build the dedicated Client_code table? Is it just an aggregation, with one row per client and their total revenue?

3. "the measure will benefit from filterpropagation instead of permanently scanning the table" What do you exactly mean by this?

4. "In regards to your 2nd question, maybe there is the possibility to create a table that just contains a single row using GROUPBY or SUMMARIZE. Then this table will calculated only once during data refresh" Is this different from what I proposed initially in question 2, with the MeasureResult table and [ReadFromTable]? If not, we would still have the problem of it being a static solution, not responding to filters

5. "You can use a more simple measure to blend the PERCENTILE value into your fact table" Can you elaborate?

Many thanks and a happy Christmas to you too

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers

TomMartens · ‎12-22-2020

Hey @AlB ,

your questions ...

1 - the path object

Basically it's a string (don't be afraid, this string can become really long), what makes this string special, is the |-sign that separates the items. There are some functions that are very efficient like PATHITEM (returning the item from a path at a given position. The value at the 25th position can be determined like so
PATHITEM( path, 25)

From my experience, navigating a string that contains the |-sign as special character (this character is used b all the PATH... functions) is very fast.

Maybe you will find this this blog interesting: The previous value - Mincing Data - Gain Insight from Data (minceddata.info)

I create a PATH-object by using the table iterator funtion CONCATENATEX(...) In our example I'm using the clients as table and the SalesValue as expression, I also use the order by and order direction parameters of the function. When I'm done p contains 1M ordered sales values.

2 - the client table
I use this DAX code to create a client_code table from Table1:

Client_code = DISTINCT( ALLNOBLANKROW( 'Table1'[Client_code] ) )

Using a dedicated table allows me to create a dimension table (on the one-side) and a fact table (on the many side) provides the usage of the filter propagation.

3 - filter propagation

Filter propagation is one of the ruling principles of the Power BI dataset (basically this concept rules everything tabular). Sometimes filter propagation is by ways faster then scanning a single table. The scanning becomes necessary as there are multiple values per client_code.

4 - SUMMARIZE or GROUPBY

Basically it's not that different from a measure result table, but I did not understand how you wanted to create the measure result table, I'm sorry for the confusion. Yes, this solution will be static.

5 - using a simplified measure

As soon as there is a result table with a single row and a single column, the cell, can be blended into the fact table Table1 by a simple measure like MAX(...) instead of evaluating complex measure 1M-times.

Hopefully, this answers some of your questions. By the way, vacation is ideal point in time for more lengthy posts.

Stay safe, stay healthy, have data, and a Merry Christmas

Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

TomMartens · ‎12-22-2020

Hey @AlB , Hey @mahoneypat ,

we are facing two problems

the lack of ordered data structures (needed for PERCENTILE... RANKX, and many more things)
a measure will be evaluated inside a filter context (1M rows, 1M filter contexts). This is a ruling principle, for this, this can not be changed.

To overcome the first issue I started using a PATH object. My measure contains this line:

var p = CONCATENATEX( ALLSELECTED( 'Table1'[Client_code] ) , [Total revenue] , "|" , [Total revenue] , ASC )

p contains an ordered structure that can be used with all the PATH functions, like PATHLENGTH or PATHITEM.

The complete measure looks like this:

percentile with path and Client_code = 
var __Percentile = 0.25
var p = CONCATENATEX( ALLSELECTED( 'Client_code'[Client_code] ) , [Total revenue] , "|" , [Total revenue] , ASC )
var position = PATHLENGTH( p ) * __Percentile
var valueAtPosition = VALUE( PATHITEM( p , position , TEXT ) )
return
valueAtPosition

As you can see I'm also using a dedicated Client_code table. I recommend using a dedicated table. There is more than one row per client, for this reason the measure will benefit from filter propagation instead of permanently scanning the table.

The above measure excutes in ~3.2s on my machine (Intel i7-9750H). I consider this not a substantial gain over the solution @mahoneypat provided meaning, maybe we have to face the fact that we already reached the optimum 🙂

In regards to your 2nd question, maybe there is the possibility to create a table that just contains a single row using GROUPBY or SUMMARIZE. Then this table will calculated only once during data refresh. You can use a more simple measure to blend the PERCENTILE value into your fact table.

Stay safe, stay healthy, have data (and of course A Merry Christmas)

Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

mahoneypat · ‎12-19-2020

I'm sure you could do the same in R or Python and it would also be fast. I don't think Charticulator would apply here, but it is an interesting idea to be able to edit the MDX behind a table/matrix visual directly. You should propose it.

Regards,

Pat

Regards,

Pat

Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.

@mahoneypa HoosierBI on YouTube

AlB · ‎12-19-2020

@mahoneypat @TomMartens (and everyone else)

Pat, thanks for creating the idea for the visual-level variables.

I was having a look further on question 2. Imagine we have a simple table visual with Table1[Client_code] and this measure that uses the built-in PERCENTILEX.INC:

PercentileM_INC with NumberCol = 
VAR wantedPerc_ = [Wanted_perc]
RETURN
PERCENTILEX.INC(ALL(Table1[Client_code]), CALCULATE(SUM(Table1[number])), wantedPerc_)

where [Wanted_perc] is simply a constant. We can then get the query for this visual from the Performance Analyzer:

// DAX Query
DEFINE
  VAR __DS0Core = 
    SUMMARIZECOLUMNS(
      ROLLUPADDISSUBTOTAL('Table1'[Client_code], "IsGrandTotalRowTotal"),
      "PercentileM_INC_with_NumberCol", 'Table1'[PercentileM_INC with NumberCol]
    )

  VAR __DS0PrimaryWindowed = 
    TOPN(502, __DS0Core, [IsGrandTotalRowTotal], 0, 'Table1'[Client_code], 1)

EVALUATE
  __DS0PrimaryWindowed

ORDER BY
  [IsGrandTotalRowTotal] DESC, 'Table1'[Client_code]

If we run this in DAX Studio and N (number of clients) is 1M, it takes over an hour and crashes, as stated earlier. We can however modify the query to apply that visual-level variable:

// DAX Query
DEFINE
    VAR valForAllRows_ = 'Table1'[PercentileM_INC with NumberCol] //Executed only ONCE for the visual 

    VAR __DS0Core_V2 = //This is the more efficient variation
        SUMMARIZECOLUMNS (
            ROLLUPADDISSUBTOTAL ( 'Table1'[Client_code], "IsGrandTotalRowTotal" ),
            "PercentileM_INC_with_NumberCol", valForAllRows_ //Using the variable  here
        )

    VAR __DS0PrimaryWindowed =
        TOPN (
            502,
            __DS0Core_V2, //CHANGE here to swap between versions
            [IsGrandTotalRowTotal], 0,
            'Table1'[Client_code], 1
        )
EVALUATE
__DS0PrimaryWindowed
ORDER BY
    [IsGrandTotalRowTotal] DESC,
    'Table1'[Client_code]

the above executes in 2 secs

I was wondering whether something like this can be implemented in a custom visual, where you can determine how the visual is run internally?? Perhaps with Charticulator or with a Python visual?

Thoughts?

Thanks

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers

mahoneypat · ‎12-19-2020

@AlB Glad it helped. I love a good DAX or M challenge, and this was a fun one. Most times we don't need to use the value term of RANKX, but it comes in handy. Thank you for sharing your expression to see the improvements you've made. If we could make custom functions like in M, this would be a good one to keep handy.

It does also highlight the potential of having visual level measures that could be referenced on each row, which would make a much bigger improvement still. I added that idea here.

Regards,

Pat

Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.

@mahoneypa HoosierBI on YouTube

AlB · ‎12-19-2020

@TomMartens

Thanks. I look forward to your response

@mahoneypat

That's great! Thanks very much. The improvement is massive. I thought of something like that initially but assumed (wrongly, clearly) that the built-in implementation of PERCENTILEX would already be (quasi)optimized. I am really, really surprised that its performance is so poor compared to your solution 🤔

I made a couple of changes to account for the interpolation part. Not difficult but it took me a while to find online a clear definition of the exact algorithm EXCEL/DAX use. This is a first version:

PercentileINC M3 (PatMahoney's) V3 = //Added interpolation 
VAR wantedPerc_ = [Wanted_perc]
VAR baseT_ = ALL(Table1[Client_code])
VAR vSummary = ADDCOLUMNS ( baseT_, "cSum", CALCULATE ( SUM ( Table1[number] ) ) )
VAR vAddRank = ADDCOLUMNS ( vSummary, "cRank", RANKX (vSummary, [cSum],,ASC,Skip))
VAR percentileRank_ = (wantedPerc_ * (COUNTROWS(baseT_) - 1)) + 1 //Difference with .EXC
VAR percRankFloor_ = FLOOR(percentileRank_,1)
VAR percRankCeiling_ = CEILING(percentileRank_,1)
VAR ranksT_ = TOPN(percRankCeiling_-percRankFloor_+1, FILTER(vAddRank, [cRank]<=percRankCeiling_), [cRank], DESC) 
VAR val1_ = MINX(ranksT_,[cSum])
VAR val2_ = MAXX(ranksT_,[cSum])
VAR interpolated_ = val1_ + ((val2_ - val1_) * (percentileRank_ - percRankFloor_))
RETURN
interpolated_

Runs approx. in the same time as your version; so adding the interpolation has hardly any impact

And another version, some 20% faster than the previous one:

PercentileINC M3 (PatMahoney's) V4 = //Can it run faster than V3??
VAR wantedPerc_ = [Wanted_perc]
VAR baseT_ = ALL(Table1[Client_code])
VAR vAddRank =    ADDCOLUMNS ( baseT_, "cRank", RANKX (baseT_, CALCULATE (SUM ( Table1[number] )),,ASC,Skip))
VAR percentileRank_ = (wantedPerc_ * (COUNTROWS(baseT_) - 1)) + 1 //Difference with .EXC
VAR percRankFloor_ = FLOOR(percentileRank_,1)
VAR percRankCeiling_ = CEILING(percentileRank_,1)
VAR valsT_ = TOPN(percRankCeiling_-percRankFloor_+1, FILTER(vAddRank, [cRank]<=percRankCeiling_), [cRank], DESC) 
VAR auxT_ = ADDCOLUMNS(valsT_, "@Value", CALCULATE(SUM(Table1[number]))) 
VAR val1_ = MINX(auxT_,[@Value]) 
VAR val2_ = MAXX(auxT_,[@Value])
VAR interpolated_ = val1_ + ((val2_ - val1_) * (percentileRank_ - percRankFloor_))
RETURN
interpolated_

I will definitely mark your response as solution but want to leave this still open for discussion on question 2.

By the way, do you know if it's possible in DAX Studio to execute, for instance, a measure several times and get the average of the execution time? I.e., instead of executing the query manually 10 times, check the timings and extract the average manually, have DAX Studio do it automatically. I believe I'd read somewhere it was possible but cannot find where. Thanks

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers

TomMartens · ‎12-16-2020

Hey AIB,

I will look into your question in the upcoming days off, so it can take a while 🙂

Regards,

Tom

Did I answer your question? Mark my post as a solution, this will help others!

Proud to be a Super User!
I accept Kudos 😉
Hamburg, Germany

mahoneypat · ‎12-16-2020

@AlB Here's an approach that gets you close, and might be further refined to get it just right. I added a random number to your original table to test it out (so not the totals were the same for all clients) like this

Table1 =

VAR numClients_ = 1000000 //Change here the number of clients

VAR clients5rows =

SELECTCOLUMNS (

CROSSJOIN (

SELECTCOLUMNS ( GENERATESERIES ( 1, numClients_ ), "Client_code", [Value] ),

GENERATESERIES ( 1, 5 )

),

"Client_code", [Client_code],

"Revenue", CONVERT([Value], DOUBLE)

)

var withrandom = ADDCOLUMNS(clients5rows, "number", RANDBETWEEN(1,10000))

return withrandom

Then made a measure that creates the virtual table, finds the rank, uses that to find the rank at 25% based on # of rows/clients, and then filters down to the sum on that row (or close to it). You could refine it to find the values above and below that level and interpolate like PERCENTILEX does. For 1M clients, this ran in 3.5 s on my machine.

Percentile M3 =
VAR vSummary =
    ADDCOLUMNS (
        ALL ( Table1[Client_code] ),
        "cSum",
            CALCULATE (
                SUM ( Table1[number] )
            )
    )
VAR vAddRank =
    ADDCOLUMNS (
        vSummary,
        "cRank",
            VAR vThisValue = [cSum]
            RETURN
                RANKX (
                    vSummary,
                    [cSum],
                    vThisValue,
                    ASC
                )
    )
VAR v25rank =
    0.25
        * COUNTROWS ( vSummary )
RETURN
    MINX (
        FILTER (
            vAddRank,
            [cRank] >= v25rank
        ),
        [cSum]
    )

Regards,

Pat

Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.

@mahoneypa HoosierBI on YouTube

mahoneypat · ‎12-16-2020

Will think on it, but I doubt it. Nice job on sharing it in a clever way that is easily scaleable to test out potential solutions. This really highlights the need for visual-level and/or page-level variables. Not sure if that idea is already suggested or not, but you should if not.

Regards,

Pat

Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.

@mahoneypa HoosierBI on YouTube

Anonymous · ‎12-15-2020

@AlB

Interesting thoughts. You created a [Value] column based on the PercentileM measure, but I guess you just cannot avoid the measure calculation once you add the client code in the visual.

Regards
Paul

Optimizing PERCENTILEX

Helpful resources

Power BI Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

How to Get Your Question Answered Quickly

Fabric Data Days starts November 4th!

Optimizing PERCENTILEX

Helpful resources

Power BI Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

How to Get Your Question Answered Quickly