Re: Correlation coefficient

Daniil · ‎06-18-2017

The quick measure calculates the Pearson correlation coefficient between two measures within the category.

NAME:

Correlation coefficient

DESCRIPTION:

Calculate the Pearson correlation coefficient between two measures within the category

PARAMETERS:

Name: Category

Tooltip: The category in which you want to calculate the correlation coefficient

Type: Categorical field

Name: Measure X

Tooltip: The first measure in a correlation pair

Type: Numerical field / measure

Name: Measure Y

Tooltip: The second measure in a correlation pair

Type: Numerical field / measure

DAX:

Correlation Coefficient :=
VAR Correlation_Table =
    FILTER (
        ADDCOLUMNS (
            VALUES ( {Category} ),
            "Value_X", CALCULATE ( {Measure X} ),
            "Value_Y", CALCULATE ( {Measure Y} )
        ),
        AND (
            NOT ( ISBLANK ( [Value_X] ) ),
            NOT ( ISBLANK ( [Value_Y] ) )
        )
    )
VAR Count_Items =
    COUNTROWS ( Correlation_Table )
VAR Sum_X =
    SUMX ( Correlation_Table, [Value_X] )
VAR Sum_X2 =
    SUMX ( Correlation_Table, [Value_X] ^ 2 )
VAR Sum_Y =
    SUMX ( Correlation_Table, [Value_Y] )
VAR Sum_Y2 =
    SUMX ( Correlation_Table, [Value_Y] ^ 2 )
VAR Sum_XY =
    SUMX ( Correlation_Table, [Value_X] * [Value_Y] )
VAR Pearson_Numerator =
    Count_Items * Sum_XY - Sum_X * Sum_Y
VAR Pearson_Denominator_X =
    Count_Items * Sum_X2 - Sum_X ^ 2
VAR Pearson_Denominator_Y =
    Count_Items * Sum_Y2 - Sum_Y ^ 2
VAR Pearson_Denominator =
    SQRT ( Pearson_Denominator_X * Pearson_Denominator_Y )
RETURN
    DIVIDE ( Pearson_Numerator, Pearson_Denominator )

eyJrIjoiMGQ5YzJiYTItZWFiMy00MGI2LTg1NzktYjMwYTU1YjA2N2M3IiwidCI6ImQzMmNkYzNmLTY1NTUtNGNhYy1iYjFhLTg2OWZiMTE0MzRlNSJ9

dfreeman2 · ‎05-14-2024

Sorry to revive an old thread, but I was wondering what the differences were in the DAX Linest and Linestx in calculating the coefficient of detemination. I was using the square for the Pearson coefficient to get R squared. It appears that I am getting different coefficient values between the two functions. I checked in excel and also was getting the same values as the Linest function. Would you happen to know if there are approximations in the Linest function vs your Pearson coefficient?

lbendlin · ‎05-14-2024

Can't see a reason for that, but is likely a question for @jeffrey_wang

dfreeman2 · ‎05-14-2024

@jeffrey_wang The measure forces the data to be summarized. Where Linest and Linestx are using the raw data to determine the best fit line and calculating all the variables. Would appear that for correlation coefficient, this is not a good quick measure, as the R value is skewed without being able to not use a summarization.

jeffrey_wang · ‎05-14-2024

Linest is just a shortcut for LinestX with the <table> argument derived from the column references. The internal algorithm is the same.

dfreeman2 · ‎05-14-2024

Right, I was trying to determine why the Quick measure for Correlation Coefficient is giving an R value that when squared does not match Linest. Since the quick measure is forcing the summarization, it is only an estimated R value.

Joey_ · ‎06-29-2023

Thank you so much for this! Helped out a lot and saved me plenty of time.

systemnova · ‎01-17-2021

This is wonderfully useful thanks @Daniil
I've been trying to modify your DAX to use in a scenario where the variables to be correlated are pivoted and the categories are contained in an attribute column (example below). Sadly I'm not having much luck, does anyone have any suggestions on how to generate a correlation matrix with this data structure?

Item	Attribute	Value
A	Value X	2
B	Value X	3
C	Value X	5
D	Value X	7
A	Value Y	0
B	Value Y	1
C	Value Y	1
D	Value Y	2

lbendlin · ‎08-10-2021

Statistics only work with numbers. You need to add numerical key columns for your text columns before you can calculate the correlation.

Anonymous · ‎08-06-2022

Apologies for reviving an old question however I want to check your response to systemnova if you wouldn't mind.

Say I wanted to check the correlation between day of the week and temperature and had collected the below data:

Index	Category	Temp
1	Monday	20
2	Monday	24
3	Monday	19
4	Wednesday	15
5	Wednesday	28
6	Wednesday	12
7	Wednesday	20
8	Friday	20
9	Friday	10

Are you saying I can swap the days of the week for a numerical key such as Monday = 1, Wednesday = 2 and Friday = 3 to give the below and then use this quick measure to produce meaningful results? Unfortunately I am still very new to stats and Power BI. Any direction would be greatly appreciated. Thanks.

Index	Category	Temp
1	1	20
2	1	24
3	1	19
4	2	15
5	2	28
6	2	12
7	2	20
8	3	20
9	3	10

lbendlin · ‎08-06-2022

yes, that's what I am saying. Replace text values with their numerical index and compute the correlation over the indexes. Afterwards you can map it back.

(bit unfortunate that your sample table has an "Index" column. Please discard that, it doesn't help)

Anonymous · ‎08-07-2022

Discard the Index column? I assumed that would be the 'Category' element of the quick measure. Would it be correct to instead have the following, where the elements of the quick measure are:

Category = Day of Week column

Measure X = Day of Week Key column
Measure Y = Temp column

Index	Day of Week	Day of Week Key	Temp
1	Monday	1	20
2	Monday	1	24
3	Monday	1	19
4	Wednesday	2	15
5	Wednesday	2	28
6	Wednesday	2	12
7	Wednesday	2	20
8	Friday	3	20
9	Friday	3	10

lbendlin · ‎08-07-2022

Ah, the quick measure. I never understood why they added a category there and made it mandatory. It has (in my opinion) nothing to do with the computation. I always implement the Pearson algorithm manually.

Anonymous · ‎03-19-2020

Hello everyone,

A little bit stuck on this one... Can someone guide me?I have a data set with the following fields...

Year
Venue
Event Length(in days)
Average Event revenue per day

I want to see if there's a correlation between even length and event revenue per day. I thought I could use the year as the category, but I think this might be wrong. Using the Quick Measure, what Category, and 2 measures should I use to establish the hypotheses that "the longer the event, the more/less menoy per day is made"

Thanks!

PowerBI_77 · ‎01-22-2019

Thanks for the formula Daniil. I'm trying to use it to calculate the correlation coefficient on some data I have but I have hit an issue - I only have two columns of data, one showing the month name and one showing the volume (a measure).

I.e.

Month Volume

Jan-07 1,000

Feb-07 1,613

Mar-07 1,128

etc.

I don't have two measures as per your data nor do I have a column showing a 'category'. How would I be able to carry out the calculation for my data please? Do I need to create a new measure column based on the month name, coverting it into a numerical value? (just a guess). Then I would classify the month column as the 'category'?

Hope this is clear (still classing myself as a Power BI newbie!)

Regards,

Lee

Daniil · ‎01-22-2019

@PowerBI_77 please have a look at the second example in my Simple Linear Regression blog -- you can employ a similar technique here.

Also, in case people still read this -- @acanepa said in a private message there "is not an error of the calculation you created but rather an error on my end to feed the formula with wrong numbers", so please ignore his comment about negative numbers.

ZanderFick · ‎01-24-2018

Very handy addition.

Are there, however, plans to add a measure/some other output feature that will also report on the uncertainty of the Correlation Coefficient calculated for a given series pair (i.e. implementing Fisher's z-transformation and evaluating the confidence interval at difference levels that the user chooses, or just a standard set of levels like 80%, 90 % and 95%)

The risk is that people could state (and frequently do state) correlation coefficients for insufficiently sized samples and derive insights that are actually attributable to noise etc.

Thanks for the awesome work!

acanepa · ‎11-15-2017

Hello @Daniil,

Thanks for this, this a great solution for correlation calculations on DAX.

I have tried the calculation with different measures and I will add the following improvement. In the following line, you could get a negative number.

VAR Pearson_Denominator =
    SQRT ( Pearson_Denominator_X * Pearson_Denominator_Y )

I suggest this slight modification to run in all scenarios.

VAR Pearson_Denominator =
    SQRT(ABS( Pearson_Denominator_X * Pearson_Denominator_Y ))

Regards,

numbus · ‎11-26-2017

Hi acanepa

Good programming practice would suggest avoiding a division by zero, however you need to think like a statistician - which can often be counter intuitive!

You actually want the Pearson Coefficient to "fail" when you divide by zero.

See this post for more info https://stackoverflow.com/questions/38548343/pearson-correlation-fails-for-perfectly-correlated-sets

From a DAX point of view the divide function will tolerate a division by zero.

Regards

Graeme

FabCon is coming to Atlanta