cancel
Showing results for
Did you mean:

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Frequent Visitor

## LINEAR REGRESSION - X-AXIS IN TEXT FORMAT

Hello

I am working with a data sample that contains numeric values on "y" axis (pH results) and text values on "x" axis (sample/batch code) and need to build a linear regression analysis.

So far, I found a website "IterationInsights" that published an article "How to do simple linear regression in Power BI",

which really helped me to analyse using date, but not sample code (since it is a text). I need to evaluate the trend between 2 samples, independent of when they were collected.

Note that if x is a Date, we have: y = 0.0023x - 99.065, r² = 0.2566

But if x is the sample, we have: y = 0.0297x + 5.5252, r² = 0.3024

How am I able to build a variable table that converts the sample code to a numeric sequence (A1AA/22 = 1, A2AA/22 = 2, ...., A21AA/22 = 21)?
Besides, it's possible to filter this table? Ex: analyse A12AA/22 to A20AA/22 (interval which would automatically be converted to the numbers: A12AA/22 = 1, A13AA/22 = 2, ....,  A20AA/22 = 9, always starting with 1).

Since slope formula is: [n(Σxy) - (Σx)(Σy)] / [n(Σx²) - (Σx)²],

x should be refeered as a sequence of natural numbers representing the samples.

n is the distinct count of the samples.

I suppose that I should build a data var table to convert sample code to a number, and use All selected function to filter the samples I'm working with, but I'm really having a hard time to combine these functions and obtain a functional linear regression equation.

I exemplified the solpe formula because I think the same solution would be applied to intercep, r², etc.

For slope, I used the formula:

---------------------------------------------------

SLOPE =

var data = SELECTCOLUMNS(ALLSELECTED(DATA),

"x_values",DATA[DATE],

"y_values",DATA[RESULTS]

)

--Variables to Solve for:

var y_sum = SUMX(data,[y_values])

var x_sum = SUMX(data,[x_values])

var x2 = SUMX(data,[x_values]^2)

var xy = SUMX(data,[x_values]*[y_values])

var row_count = COUNTROWS(data)

--Solve for formula

var Slope = ((row_count * xy) - (x_sum * y_sum))/((row_count*x2)-((x_sum)^2))

return Slope

-----------------------------------------------------

Anyone could help me?

2 REPLIES 2
Super User

@pedroluccas So, duplicate your query in Power Query Editor. Remove all rows except the SAMPLE CODE column. Do a Remove Duplicates. Create an Index column (Add Columns | Index). You can then either do a merge to get your index or in a DAX calculated column, add a column to your original table that grabs the Index using RELATED.

You can find sample code here:

Correlation coefficient - Microsoft Power BI Community

If you have seasonality in your data:

De-Seasonalized Correlation Coefficient - Microsoft Power BI Community

Become an expert!: Enterprise DNA
External Tools: MSHGQM
Latest book!:
The Definitive Guide to Power Query (M)

DAX is easy, CALCULATE makes DAX hard...
Frequent Visitor

Thank you very much!

It worked for Slope and R². But now seems that I have other problem.

When I have a huge amount of samples of different products: A1AA/22 (Product A), B1AA/22 (Product B), C1AA/22 (Product C), etc, it generates a index with huge numbers (10,000 or more).

Other scenario is when I wnat to use just an interval of the samples (sample 5 - 12).

Since Intercept formula is b = [(Σy)*(Σx²) - (Σx) * (Σxy)] / [n * (Σx²) - (Σx)²], and now appears the factor (Σx²) on the numerator, huge numbers seems to affect the calculation.

See below the calculation and graphs

What can I do to make the index change according with my selection and restart the count with the number 1?

Announcements

#### Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.