Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
Anonymous
Not applicable

Convert weighted list to vector or list

I have a dataset of thousands of products with millions of transactions, where most of these transactions involve QTY <> 1.  One measurement I'm attempting to calculate is comparing the skewness of a product's price over time, but this involves transforming a dataset like:

 

ProductPriceQty
A$0.902
A$1.001
A$1.101
B$1.001

 

List for A:

$0.90$0.90$1.00$1.10

 

This way, I can derive:

  • Mean = $0.975
  • Median = $0.95
  • Mode = $0.90
  • Std Dev = 0.082916
  • Pearson's 1st Skewness = 0.90453
  • Pearson's 2nd Skewness = 0.90453

 

Any recommendations on how DAX can come up with a list that repeats values based on a second column?  I'm hesitant to even build out a second fact table that does this transformation because it will just become rediculously long.  Creating a summarized table using SQL or M won't work because the list needs to be able to assume filters throughout the rest of the report (e.g. product, customer, transaction date, &c.)  Calculating Mean and Mode are easy enough to perform with the dataset, but I need to transform the data to calculate a proper median and standard deviation.

1 ACCEPTED SOLUTION

That is a lot of rows.  I thought more about this and came up with a way in DAX to make your table virtually (it could also be adapted to make it physically in a DAX calculated table). 

 

I know you'll probably stick with R or Python, but this was a fun challenge, so I worked on it further.  Please see the measures below.  Mean was straight forward, Mode a little trickier, but Median and StdDev are the ones that use the virtual table.  Pearson's is then calculated from mean, median, and stddev (hopefully that approach works for you).

 

Mean =
SUMX ( Data, Data[Price] * Data[Qty] )
    / SUM ( Data[Qty] )

Mode =
TOPN ( 1, VALUES ( Data[Price] ), CALCULATE ( SUM ( Data[Qty] ) ) )

Median = 
VAR numberofvalues =
    SUM ( Data[Qty] )
VAR sequencetable =
    ADDCOLUMNS (
        SUMMARIZE ( Data, Data[Price], Data[Qty] ),
        "minsequence",
        VAR thisprice = Data[Price]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + 1,
        "maxsequence",
        VAR thisprice = Data[Price]
        VAR thisqty = Data[Qty]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + thisqty
    )
VAR numberseries =
    ADDCOLUMNS (
        GENERATESERIES ( 1, numberofvalues, 1 ),
        "PriceValue",
        VAR thisnumber = [Value]
        RETURN
            MINX (
                FILTER (
                    sequencetable,
                    AND ( [minsequence] <= thisnumber, [maxsequence] >= thisnumber )
                ),
                Data[Price]
            )
    )
RETURN
    MEDIANX(numberseries, [PriceValue])


Std Dev = 
VAR numberofvalues =
    SUM ( Data[Qty] )
VAR sequencetable =
    ADDCOLUMNS (
        SUMMARIZE ( Data, Data[Price], Data[Qty] ),
        "minsequence",
        VAR thisprice = Data[Price]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + 1,
        "maxsequence",
        VAR thisprice = Data[Price]
        VAR thisqty = Data[Qty]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + thisqty
    )
VAR numberseries =
    ADDCOLUMNS (
        GENERATESERIES ( 1, numberofvalues, 1 ),
        "PriceValue",
        VAR thisnumber = [Value]
        RETURN
            MINX (
                FILTER (
                    sequencetable,
                    AND ( [minsequence] <= thisnumber, [maxsequence] >= thisnumber )
                ),
                Data[Price]
            )
    )
RETURN
    STDEVX.P(numberseries, [PriceValue])


Pearson's Skewness = 3*([Mean] - [Median])/[Std Dev]

 

Note: Mode might give an error if you have two prices that occur the same number of times.  I focused on making the virtual table for Median and StdDev, and may edit this post later to give a more robust mode (ran out of time now).  Hopefully, you can use the approach to make the virtual table with the repeated values to meet your needs.

 

pearson.png

 

If this works for you, please mark it as the solution.  Kudos are appreciated too.  Please let me know if not.

Regards,

Pat





Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.


@mahoneypa HoosierBI on YouTube


View solution in original post

4 REPLIES 4
v-kelly-msft
Community Support
Community Support

Hi @Anonymous ,

 

First create a calculated column as below:

Column = REPT(CONCATENATE("$"&'Table'[Price]," "),'Table'[Qty])

Then create a measure as below:

List for A = CONCATENATEX(FILTER(ALL('Table'),'Table'[Product]="A"),'Table'[Column]," ")

And you will see:

Annotation 2020-07-24 100334.png

For details,pls attached.

 

Best Regards,
Kelly
Did I answer your question? Mark my post as a solution!
mahoneypat
Microsoft Employee
Microsoft Employee

I would just add a couple steps to your existing query to make the extra rows, so your downstream analysis will be easy.  Here is an example on how you can use List.Repeat and the Qty value to get the modified table from your example data.  To see how it works, just create a blank query, go to Advanced Editor, and replace the text there with the M code below.

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WclTSUVIx0LM0ANJGSrE6UBFDPQOQiCGyiCFCxAlVTSwA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Product = _t, Price = _t, Qty = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Product", type text}, {"Price", Currency.Type}, {"Qty", Int64.Type}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each List.Repeat({[Price]}, [Qty])),
    #"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom"),
    #"Changed Type1" = Table.TransformColumnTypes(#"Expanded Custom",{{"Custom", Currency.Type}})
in
    #"Changed Type1"

 

If this works for you, please mark it as the solution.  Kudos are appreciated too.  Please let me know if not.

Regards,

Pat





Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.


@mahoneypa HoosierBI on YouTube


Anonymous
Not applicable

Hi @mahoneypat ,

 

While this solution works in principle, it's failing in application for my use-case.  The problem I'm running into is the incredibly resource-inefficient processes going on behind the scenes in PowerQuery.  Here is my M script:

 

let
    Source = Oracle.Database("DSPROD", [HierarchicalNavigation=true, Query="
select DIM_CUST_CURR_ID,
       dim_item_curr_id,
       UNIT_PRICE,
       sum(sell_qty) as PRMRY_SELL_QTY
  from FOO
where dim_ordr_dt_id >= TO_NUMBER(TO_CHAR(SYSDATE-31,'YYYYMMDD'))
having round(sum(sell_qty),0) > 0
GROUP BY DIM_CUST_CURR_ID,
         dim_item_curr_id,
         UNIT_PRICE
--FETCH FIRST 10 ROWS ONLY"]),
    #"Added Custom" = Table.AddColumn(Source, "Price", each List.Repeat({[UNIT_PRICE]}, Number.Round([PRMRY_SELL_QTY], 0))),
    #"Expanded Prices_1" = Table.ExpandListColumn(#"Added Custom", "Price"),
    #"Removed Other Columns" = Table.SelectColumns(#"Expanded Prices_1",{"DIM_CUST_CURR_ID", "DIM_ITEM_CURR_ID", "UNIT_PRICE"})
in
    #"Removed Other Columns"

 

 

When I download a 40-day transaction history, I begin with a dataframe that is 4 columns wide (customer account number, item number, price, qty) and ~1MM records long, and the corresponding csv (same query is kicked out by another SQL program for debuggin purposes) is ~25MB.  After transforming accordingly, I have a dataframe that is 3 columns wide, 900MM+ rows long, and PowerQuery somehow consumes 42GB of cache on my hard drive.

 

I need to compare this 40-day history to the trailing 30-day sales history, which is trending an initial 4 columns x ~600K rows, transforms into 3 columns x ~650MM rows, and also consumes an additional ~30GB of cache on my hard drive.

 

The only way I was able to get my model to refresh is to manually refresh each table independently because my computer runs out of disk space.  Moreover, I suspect this processing won't fly when refreshing with the online service.  Fortunately, the end-result of the data increases the size of the .pbix file by less than 50MB, so I know the problem isn't the amount of data, but rather that PowerQuery isn't the right tool for the job.  Sadly, I can only use Python or R to perform the manipulations if I use a personal gateway instead of the enterprise gateway 😞

That is a lot of rows.  I thought more about this and came up with a way in DAX to make your table virtually (it could also be adapted to make it physically in a DAX calculated table). 

 

I know you'll probably stick with R or Python, but this was a fun challenge, so I worked on it further.  Please see the measures below.  Mean was straight forward, Mode a little trickier, but Median and StdDev are the ones that use the virtual table.  Pearson's is then calculated from mean, median, and stddev (hopefully that approach works for you).

 

Mean =
SUMX ( Data, Data[Price] * Data[Qty] )
    / SUM ( Data[Qty] )

Mode =
TOPN ( 1, VALUES ( Data[Price] ), CALCULATE ( SUM ( Data[Qty] ) ) )

Median = 
VAR numberofvalues =
    SUM ( Data[Qty] )
VAR sequencetable =
    ADDCOLUMNS (
        SUMMARIZE ( Data, Data[Price], Data[Qty] ),
        "minsequence",
        VAR thisprice = Data[Price]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + 1,
        "maxsequence",
        VAR thisprice = Data[Price]
        VAR thisqty = Data[Qty]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + thisqty
    )
VAR numberseries =
    ADDCOLUMNS (
        GENERATESERIES ( 1, numberofvalues, 1 ),
        "PriceValue",
        VAR thisnumber = [Value]
        RETURN
            MINX (
                FILTER (
                    sequencetable,
                    AND ( [minsequence] <= thisnumber, [maxsequence] >= thisnumber )
                ),
                Data[Price]
            )
    )
RETURN
    MEDIANX(numberseries, [PriceValue])


Std Dev = 
VAR numberofvalues =
    SUM ( Data[Qty] )
VAR sequencetable =
    ADDCOLUMNS (
        SUMMARIZE ( Data, Data[Price], Data[Qty] ),
        "minsequence",
        VAR thisprice = Data[Price]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + 1,
        "maxsequence",
        VAR thisprice = Data[Price]
        VAR thisqty = Data[Qty]
        RETURN
            CALCULATE (
                SUM ( Data[Qty] ),
                ALL ( Data[Price], Data[Qty] ),
                Data[Price] < thisprice
            ) + thisqty
    )
VAR numberseries =
    ADDCOLUMNS (
        GENERATESERIES ( 1, numberofvalues, 1 ),
        "PriceValue",
        VAR thisnumber = [Value]
        RETURN
            MINX (
                FILTER (
                    sequencetable,
                    AND ( [minsequence] <= thisnumber, [maxsequence] >= thisnumber )
                ),
                Data[Price]
            )
    )
RETURN
    STDEVX.P(numberseries, [PriceValue])


Pearson's Skewness = 3*([Mean] - [Median])/[Std Dev]

 

Note: Mode might give an error if you have two prices that occur the same number of times.  I focused on making the virtual table for Median and StdDev, and may edit this post later to give a more robust mode (ran out of time now).  Hopefully, you can use the approach to make the virtual table with the repeated values to meet your needs.

 

pearson.png

 

If this works for you, please mark it as the solution.  Kudos are appreciated too.  Please let me know if not.

Regards,

Pat





Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.


@mahoneypa HoosierBI on YouTube


Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

October Power BI Update Carousel

Power BI Monthly Update - October 2025

Check out the October 2025 Power BI update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.