I love using PowerBI for analysis! Recently PowerBI added support for AutoML (Supervised Automated machine Learning) and I wanted to demonstrate the new ML capabilities with an end to end example. In this article, I use AutoML to create and apply a regression model. Regression is a machine learning technique used for understanding the relationship between input and output variables (input is what is known in your data and output is generally the effect of input data and is sometimes unknown). Regression technique gives us more insight into data by making it easier to see relation between dependent and independent variables. These are the five steps I followed in this article to create the Regression model:
Refer this tutorial for a detailed walk through of each of these steps, and this tutorial for ones specific to regression models.
Creating a dataflow with the input data
I created a dataflow using the House Sales prediction dataset from Kaggle. It is also available, at the following link: house sales prediction for purposes of this article. It contains sale prices, number of bedrooms, square footage, number of floors, latitude, longitude etc. of houses in King County (which includes Seattle) sold between May 2014 and May 2015. The Price attribute indicating the price of the house is the numeric field used as label for the Regression model. The dataset was split into two entities HousePrices2014 and HousePrices2015 based on the year attribute.
Note that though most of the attribute data types in this dataset are numeric, these may be interpreted as strings by Power Query. So, most of these fields, including the label, were converted to numeric.
Training a machine learning model
I chose the HousePrices2014 as the entity to apply ML models. Power BI analyzed the Price field and suggested Regression as the type of machine learning model that can be created to predict that field. It took around an impressive 7 minutes to train 12 algorithms on 14 thousand data points.
Model validation report
After the training refresh completed, I checked out the training report. Key Influencers shows that Latitude, Sqft_living, Grade are the top predictors for the model.
To verify this, I plotted a ft_living (square footage of the home) and price. The plot does indeed show a relationship indicating that that higher square feet living indicates higher prices. As shown in the sqft_living breakdown above, 2.7k to 2.9k sqfeet houses has the highest share of range 1 priced houses (72k to 1M).
Likewise, a scatterplot between grade (overall grade given to the housing unit, based on King County grading system) and price shows houses with higher grades does have higher prices.
The model is 88% performant, which is a pretty great value for this dataset. As you can see in the chart below, most of the test samples lie in the diagonal showing the low error in prediction.
The Training Details page shows the performance of the 25 different models, how features were extracted from the inputs, and the hyperparameters for the final model used. AutoML chose an ensemble, Pre fitted Soft Voting Regressor as the best model for this dataset.
Applying the model
As I was satisfied with the training results, I proceeded to apply the model to the HousePrices2015 entity. A HousePrices2015 enriched RegressionModel was created, which includes the predicted output from the model. Overall, it took only around 2 minutes to score nearly 7 thousand rows!
Applying a Regression model added two columns with the Regression outcome, and the top record-specific influencers for each prediction. The HousePrices2015 enriched Regression entity with explanation, predicted prices is shown below along with the actual prices in PowerBI Desktop. The local explanations contains break down of how much each feature moved the specific instance away from the average price of houses. For categorical features, it shows how much price would have changed for different values of the categories. In the highlighted example, you can see that the house price was predicted to be $554174. The explanation says that, "condition", caused the price to increase by 44k from the average price of houses, and that the "latitude" caused the price to fall down by 26k.
I established a connection to the dataflow from PBI Desktop, to incorporate these predictions from the model in PBI reports so that I can visualize these results. I plotted a heat map of predicted prices using latitude, longitude. Prices were represented by a red (100% gradient stop) through green (50% gradient stop), to yellow(0% gradient stop) gradient.
We can see that, for example, Downtown Seattle and Downtown Bellevue have red regions as expected, owing to high prices.
I tried a scikit-learn Random Forest on this same dataset. It took 1 min using 3-fold cross validation, searching across 30 different combinations, using all available cores to give an R2 score of 0.85.
Summary
Even though AutoML took longer, I am impressed to have gotten a better R2 score of 0.88 by trying not just Random Forest but 25 different models, in addition to getting instance-level explanations. Moreover, AutoML automatically does preprocessing like dropping features like ID with no useful information and generating additional features for datetime like month, year etc. On the other hand, in scikit-learn one must encode string or date-time like features or drop it.
Overall, I’m very excited about AutoML, because I did not have to learn the intricacies of the ML models nor worry about the implementation and still get the same results. I have done these in Python and R and it took me considerably longer to code up and setup the hyperparameters, whereas in AutoML I am just hitting a bunch of buttons and voila I am done! Given how much easier it is to get started with lesser effort, I encourage business analysts and even data scientists/analysts to try out AutoML!