The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
05-08-2018 07:58 AM - last edited 05-08-2018 08:00 AM
This measure allows you to predict dependent values Y from independent values X.
Simple linear regression
Estimate Y values based on X values.
Name: Category
Tooltip: The category for which you have known X and Y values
Type: Categorical field
Name: Measure X
Tooltip: Known X (independent) values
Type: Numerical field / measure
Name: Measure Y
Tooltip: Known Y (dependent) values
Type: Numerical field / measure
Estimated {Measure Y} =
VAR Known =
FILTER (
SELECTCOLUMNS (
ALLSELECTED ( {Category} ),
"Known[X]", CALCULATE ( {Measure X} ),
"Known[Y]", CALCULATE ( {Measure Y} )
),
AND (
NOT ( ISBLANK ( Known[X] ) ),
NOT ( ISBLANK ( Known[Y] ) )
)
)
VAR Count_Items =
COUNTROWS ( Known )
VAR Sum_X =
SUMX ( Known, Known[X] )
VAR Sum_X2 =
SUMX ( Known, Known[X] ^ 2 )
VAR Sum_Y =
SUMX ( Known, Known[Y] )
VAR Sum_XY =
SUMX ( Known, Known[X] * Known[Y] )
VAR Average_X =
AVERAGEX ( Known, Known[X] )
VAR Average_Y =
AVERAGEX ( Known, Known[Y] )
VAR Slope =
DIVIDE (
Count_Items * Sum_XY - Sum_X * Sum_Y,
Count_Items * Sum_X2 - Sum_X ^ 2
)
VAR Intercept =
Average_Y - Slope * Average_X
RETURN
Intercept + Slope * {Measure X}
For more details on and other uses of this quick measure, see my blog post on the subject:
https://xxlbi.com/blog/simple-linear-regression-in-dax/
eyJrIjoiZWNiNTQ2MGEtMjhlNS00YTNhLWE2NTktZDg3MzcxMjQ0NDc4IiwidCI6ImQzMmNkYzNmLTY1NTUtNGNhYy1iYjFhLTg2OWZiMTE0MzRlNSJ9
Hi Daniil,
Thanks a lot for your code, it actually helped me a lot to understand way better DAX syntax and measures. I have a comment, more related to the logic and statistical part of the code than to the coding perse. I believe that 9th line of code, instead of having an AND you should have an OR. My logic is the following, imagine you have only four data points for variables X and Y:
X Y
0 2
3 1
4 0
0 0
According to your code, "AND (NOT ( ISBLANK ( Known[X] ) ),NOT ( ISBLANK ( Known[Y] ) ))" you are filtering and taking only the data points that are not zero on X AND are not zero on Y. That, in my example, would exclude from the calculation all but the second data point. I believe this is a statistical error, since lines one and three from the example provide information that you should take into account when making linear regression. On the other hand, if you change that line for "OR(NOT ( ISBLANK ( Known[X] ) ),NOT ( ISBLANK ( Known[Y] ) ))", you are filtering and taking the data points that are not zero on X OR not zero on Y. In the example provided, that, takes out only the fourth data point which doesn't provide any statistically significant information, at least from my perspective. Would you mind clearing this out for me?
Regards
I've dabbled in multiple linear regression. Here's what it looks like with three x variables.
Regression Coefficients =
VAR ShortNames =
SELECTCOLUMNS (
Returns,
"A", [Equity], /*Known X1 values*/
"D", [Duration], /*Known X2 values*/
"C", [Credit], /*Known X3 values*/
"Y", [Manager] /*Known Y values*/
)
VAR n = COUNTROWS ( ShortNames )
VAR A = SUMX ( ShortNames, [A] )
VAR D = SUMX ( ShortNames, [D] )
VAR C = SUMX ( ShortNames, [C] )
VAR Y = SUMX ( ShortNames, [Y] )
VAR AA = SUMX ( ShortNames, [A] * [A] ) - A * A / n
VAR DD = SUMX ( ShortNames, [D] * [D] ) - D * D / n
VAR CC = SUMX ( ShortNames, [C] * [C] ) - C * C / n
VAR AD = SUMX ( ShortNames, [A] * [D] ) - A * D / n
VAR AC = SUMX ( ShortNames, [A] * [C] ) - A * C / n
VAR DC = SUMX ( ShortNames, [D] * [C] ) - D * C / n
VAR AY = SUMX ( ShortNames, [A] * [Y] ) - A * Y / n
VAR DY = SUMX ( ShortNames, [D] * [Y] ) - D * Y / n
VAR CY = SUMX ( ShortNames, [C] * [Y] ) - C * Y / n
VAR BetaA =
DIVIDE (
AY*DC*DC - AD*CY*DC - AY*CC*DD + AC*CY*DD + AD*CC*DY - AC*DC*DY,
AD*CC*AD - AC*DC*AD - AD*AC*DC + AA*DC*DC + AC*AC*DD - AA*CC*DD
)
VAR BetaD =
DIVIDE (
AY*CC*AD - AC*CY*AD - AY*AC*DC + AA*CY*DC + AC*AC*DY - AA*CC*DY,
AD*CC*AD - AC*DC*AD - AD*AC*DC + AA*DC*DC + AC*AC*DD - AA*CC*DD
)
VAR BetaC =
DIVIDE (
- AY*DC*AD + AD*CY*AD + AY*AC*DD - AA*CY*DD - AD*AC*DY + AA*DC*DY,
AD*CC*AD - AC*DC*AD - AD*AC*DC + AA*DC*DC + AC*AC*DD - AA*CC*DD
)
VAR Intercept =
AVERAGEX ( ShortNames, [Y] )
- AVERAGEX ( ShortNames, [A] ) * BetaA
- AVERAGEX ( ShortNames, [D] ) * BetaD
- AVERAGEX ( ShortNames, [C] ) * BetaC
RETURN
{ BetaA, BetaD, BetaC, Intercept } /*Pick One*/
(From my StackOverflow post.)
Hi @Daniil,
I'm trying to find the slope of a graph comparing 2 pollution datasets. I altered your measure to do this. The measure works well until it comes to summing the X^2 and XY values. The multiplication doesn't work and this affects the slope calculation.
What could be causing this problem?
PS: I'm very inexperienced with Power BI, so forgive me if this is a stupid question related to SUM and SUMX.
Really excellent stuff !!
I had managed to do a linear regression line previously but this is so much cleaner
I did have to tweak yours slightly
1) Instead of a plain old Allselected i needed to
do
CALCULATETABLE (
SUMMARIZE (
FeederHistory,
FeederHistory[EventDateHour],
FeederHistory[EventDate]
),
ALLSELECTED ( FeederHistory )
),
and for some reason SSRS moaned at me for using SELCTEDVALUE so I replaced it with the older hasonevalue values paradigm
ie
RETURN
Intercept
+ Slope
* IF (
HASONEVALUE ( FeederHistory[EventDate] ),
VALUES ( FeederHistory[EventDate] ))
Really excellent stuff !!
I had managed to do a linear regression line previously but this is so much cleaner
I did have to tweak yours slightly
1) Instead of a plain old Allselected i needed to
do
CALCULATETABLE (
SUMMARIZE (
FeederHistory,
FeederHistory[EventDateHour],
FeederHistory[EventDate]
),
ALLSELECTED ( FeederHistory )
),
and for some reason SSRS moaned at me for using SELCTEDVALUE so I replaced it with the older hasonevalue values paradigm
ie
RETURN
Intercept
+ Slope
* IF (
HASONEVALUE ( FeederHistory[EventDate] ),
VALUES ( FeederHistory[EventDate] ))
Hi
Thanks for your comments. If, like me you ever need to get the first and last values for a given filtered interval to calculate the total trend change over time, or rate of trend change over time, the Starting and Ending measures are great. When you get something like this working it really puts a smile on your face! With these complicated measures I can highly recomment DAX formatter if you are not already using it.
Tony
I know this is a Power Bi portal but I had to make it work for both PBI and SSRS
Hi
I am using your Simple Linear Regression measure with great success. Thank You.
I am doing temperature studies and I want to find the first and last values of your measure in a given time interval which is filtered by a slicer. By doing this I can calculate the difference in the trend line over the interval of filtered years.
I can find the first and last year of the filtered interval by using FIRSTNONBLANK and LASTNONBLANK but I can't do this with your code because FIRSTNONBLANK and LASTNONBLANK only accept a column as an argument. I cannot seem to convert your measure into a calculated column to do this. Is there a version of your measure that produces a column? My data has two columns year and temperature. X and Y.
Any help would be much appreciated.
Tony Maclaren
Thanks for your feedback, Tony!
If I understand you correctly, there are a few ways to achieve your goal.
Here is an example of a measure:
Starting Temperature = VAR Known = FILTER ( SELECTCOLUMNS ( ALLSELECTED ( 'Table'[Year] ), "Known[X]", 'Table'[Year] ), "Known[Y]", [Temperature] ), AND ( NOT ( ISBLANK ( Known[X] ) ), NOT ( ISBLANK ( Known[Y] ) ) ) ) VAR First = TOPN ( 1, Known, Known[X], ASC ) RETURN MINX ( First, Known[Y] )
Dear Daniil
Thank you for your very prompt response. I will be in Sydney, where I grew up, in early December!
Here is the tweaked code below (minus some brackets) that does the trick.Thank you so much.
This is an incredibly useful companion measure to your Simple Linear Regression measure, and because it works with measures it will find the equivalent of FIRSTNONBLANK in a measure. However it does not work correctly in the filtered context of a YEAR slider to filter the interval of years which the Simple Linear Regression measure does .The value changes with the slider, but the results do not correctly match the estimated starting value of the Estimated measure except for the first value of the whole dataset.
Starting Temperature =
VAR Known =
FILTER (
SELECTCOLUMNS (
ALLSELECTED ( 'cetdata'[YEAR] ),
"Known[X]", 'cetdata'[YEAR],
"Known[Y]", [Estimated]
),
AND (
NOT ( ISBLANK ( Known[X] ) ),
NOT ( ISBLANK ( Known[Y] ) )
)
)
VAR First =
TOPN ( 1, Known, Known[X], ASC )
RETURN
MINX ( First, Known[Y] )
NOT ( ISBLANK ( Known[Y] ) )
)
)
VAR First =
TOPN ( 1, Known, Known[X], ASC )
RETURN
MINX ( First, Known[Y] )
Now we have to fix the filter context and figure out how to find [Ending temperature] so we can subtract [Starting Temperature] and get the trend change per filtered interval.
Thanks again--great stuff. No one else had a clue what my problem was or how to solve it.
Tony
Tony, you should visit the local Power BI User Group if you get a chance 🙂 The next meeting date should be announced at Meetup soon.
Try the following measures:
Starting Temperature = VAR Estimate = SELECTCOLUMNS ( KEEPFILTERS ( ALLSELECTED ( 'cetdata'[YEAR] ) ), "Estimate[X]", 'cetdata'[YEAR], "Estimate[Y]", [Estimated] ) VAR First = TOPN ( 1, Estimate, Estimate[X], ASC ) RETURN MINX ( First, Estimate[Y] )
Ending Temperature = VAR Estimate = SELECTCOLUMNS ( KEEPFILTERS ( ALLSELECTED ( 'cetdata'[YEAR] ) ), "Estimate[X]", 'cetdata'[YEAR], "Estimate[Y]", [Estimated] ) VAR Last = TOPN ( 1, Estimate, Estimate[X], DESC ) RETURN MAXX ( Last, Estimate[Y] )
Temperature Difference = [Ending Temperature] - [Starting Temperature]
Dear Daniil
Thank you very much--all working perfectly now. KEEPFILTERS did the trick.I have really learned a lot! I have now ordered the SQLBI book.
One strange thing however,The [Estimated] measure which is your Simple linear Regression measure produces a total if displayed in a table. I do not quite understand this.
Best Regards
and thanks again
Tony
@tonymaclaren, it is likely that at the grand total level what you see is not a total, but the Intercept. This is because the following expression evaluates to Intercept:
Intercept + Slope * SELECTEDVALUE ( 'cetdata'[YEAR] )
At the grant total level, there is usually more than one 'cetdata'[YEAR], hence SELECTEDVALUE ( 'cetdata'[YEAR] ) returns BLANK, turning the multiplication into BLANK as well, leaving only Intercept.
I updated my blog post to deal with this kind of situations.
@Anonymous, glad you found this useful!
Dar Daniil
Thanks for the explanation.The same value also appears at the top of the [Estimated] measure, but only if the whole of the data for ('Ctedata'[YEAR])is displayed. is there a modification to eliminate this?
Tony
Tony, it's a bit difficult to advise anything without seeing your report -- can you share a sample of your data?
Hi Daniil
Sure, no problem.What would you like me to supply and how.The full dataset is for 391 years.
Tony
Wow, 391 years of data sounds exciting 🙂 If you could reduce your data to the minimum amount with which you could reproduce your formula problems, that would be great.
Thanks Daniil
I have attached a subset of the file that exhibits the exact same problem using the following measure:
Estimated tony Data =
VAR Known =
FILTER (
SELECTCOLUMNS (
ALLSELECTED ( 'Tony_data'[YEAR] ),
"Known[X]", 'Tony_data'[YEAR],
"Known[Y]", [Averagetemp]
),
AND (
NOT ( ISBLANK ( Known[X] ) ),
NOT ( ISBLANK ( Known[Y] ) )
)
)
VAR Count_Items =
COUNTROWS ( Known )
VAR Sum_X =
SUMX ( Known, Known[X] )
VAR Sum_X2 =
SUMX ( Known, Known[X] ^ 2 )
VAR Sum_Y =
SUMX ( Known, Known[Y] )
VAR Sum_XY =
SUMX ( Known, Known[X] * Known[Y] )
VAR Average_X =
AVERAGEX ( Known, Known[X] )
VAR Average_Y =
AVERAGEX ( Known, Known[Y] )
VAR Slope =
DIVIDE (
Count_Items * Sum_XY - Sum_X * Sum_Y,
Count_Items * Sum_X2 - Sum_X ^ 2
)
VAR Intercept =
Average_Y - Slope * Average_X
RETURN
Intercept + Slope * SELECTEDVALUE ( 'Tony_data'[YEAR])
Averagetemp = SUM(Tony_data[AverageTemperature])
Tony
Tony, I can't reproduce the problem, unfortunately. Can you please attach a pbix file and screenshots with the problem highlighted?
Also, have you followed my updated blog post? I don't use SELECTEDVALUE anymore: https://xxlbi.com/blog/simple-linear-regression-in-dax/
Thanks for the measure Daniil, it is extremely helpful.
I am attempting to apply it to a forecasting problem that is relatively simple in excel, but is causing some trouble for me with Dax/PowerBI. I have 19 different categories that I need to run the regression on, and then I need to sum those categories for use on further regressions. Could your measure be modified to accomplish this or do I need to create 19 different measures?
Thanks for any help you could provide.