Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

stevedep

Series: Predicting Car Prices using Power BI (part 1)

Photo from https://www.mitsubishi-motors.nl/modellen/outlanderPhoto from https://www.mitsubishi-motors.nl/modellen/outlander

 

Introduction

This article is the first in a series. In these articles, we will predict occasion car prices (of Mitsibutsi Outlanders in the Netherlands). This article is a re-post from Medium, I am the original author. 

When you are able to predict car prices it helps to understand if a car has a relatively high or low price. Therefore, it supports decision-making when buying a car.

Once you own the car the fitted model values help to understand how quickly your car depreciates in value.

This article is structured as follows. We will start with sharing the approach. From there we describe the results of the fitted model, the final paragraph will describe how we can collect and parse the web data using Power Query / Power M. This is the main topic of this article.

The next article describes how to use R within Power Query to fit the model. The final article will provide a detailed description of the interpretation of the model f.e. by understanding confidence values.

 

The Approach

Data acquisition and manipulation are done using Power Query, the predictive model is built using R (using regression), and some plots are also generated using ggplot2 R Visuals. All steps are run using one Power BI file.

 

The data

Data is obtained from gaspedaal.nl.

Cars listed on gaspedaal.nlCars listed on gaspedaal.nl

 

The result of fitting a multiple linear regression model

This paragraph will view and interpret the regression model results.

When we fit a model using the age, the mileage, and the catalog price to predict the price, the results are as displayed below:

Image by author, taken from the Power BI file with the results.Image by author, taken from the Power BI file with the results.

 

Interpretation

Assuming the statistical certainty is high (we will get there later), we can interpret the results as follows:

For each year that the car ages, the value depreciates in value with 1958 euro.

For each kilometre that you drive, your car depreciates in value by 6 euro cent.

The full model is as follows:

Price (Y) = 35173 euro (a) + -1958 * age (X1) + 0.17 * CatalogPrice (X2) + -0.06 * Mileage / km (X3) + sigma*

Sigma, the error term, has an average of 0 and a standard deviation of 1865 euros.

The R-Squared is 95%, meaning that 95% of the variance is explained by this model. This is a good result. In a later article, we will dive into these outcomes to better understand the amount of uncertainty that remains. In the next paragraph, we will dive into the data collection approach using Power BI’s Power Query / Power M.

 

Getting the data

In this paragraph, we will describe, in detail, how to collect/scrape data from the web using Power M. The below video explains it as well, apologies for the messy start since there was no audio in the first 20 seconds. Let me introduce the video over here:

As described above, our model uses the age, mileage, and catalog price. The first two values (and the asking price) are obtained from gaspedaal.nl. The catalog price is obtained from autowereld.nl. The below video describes how to obtain the catalog prices from autowereld.nl.

 

The video goes straight into the details. Just to clarify over here. The autowereld website shows links to a set of models that fall within the same period:

stevedep_0-1672817111826.png

 

 

Therefore, we need to iterate over this list of links and then collect the data that's visible over there:

stevedep_1-1672817124450.png

That page holds the catalog prices that we are interested in.

 

The code

We start off by getting the webpage contents, notice the timeout expression at the end (important!).

let
Source = Web.BrowserContents("https://www.autowereld.nl/info/mitsubishi/outlander/", [WaitFor = [Timeout = #duration(0,0,0,10)]]),

As demonstrated in the video we use Html.Table to retrieve a table using a CSS selector. Within DevTool we can test a CSS selector when we open the tab ‘Elements’ and press Ctrl-F. In our case, we can select the url’s by searching for a.item.model. Once this proves to be a valid selector we use it in Power M:

HtmlTable = Html.Table( Source, {{"Link", "a", each [Attributes][href]}}, [RowSelector=".item.model"]),

I recommend watching the video to see how you can convert the steps required per URL into a function that you apply on each of the URLs (as you iterate over the list of URLs).

The video mentions that we create a list of words that describe the model that we found, using the code below:

Custom1 = Table.TransformColumns( #"Replaced Value1", {"Model", each List.Transform( Text.Split(_, " "), each Text.Trim(_))}),

The next video will demonstrate how we try to match as many as possible words from this list to link/join a car (for sale on gaspedaal.nl) with this table of catalog prices. We apply this approach since we are not able to use a full match, in a way we will be implementing a fuzzy join.

The full code as shown in the video:

 

 

let
    Source = Web.BrowserContents("https://www.autowereld.nl/info/mitsubishi/outlander/",  [WaitFor = [Timeout = #duration(0,0,0,10)]]),
    HtmlTable = Html.Table( Source, {{"Link", "a", each [Attributes][href]}}, [RowSelector=".item.model"]),
    #"Replaced Value" = Table.ReplaceValue(HtmlTable,"/info/","https://www.autowereld.nl/info/",Replacer.ReplaceText,{"Link"}),
    //Example = #"Replaced Value"[Link]{0},
    
    fnGetDetails = (url) =>
    let
        OpenUrl = Web.BrowserContents(url),
        HtmlTableDetails = Html.Table( OpenUrl, {{"Model", ".type"}, {"Price", ".price"}, {"Transmission", ".transmission"}, {"From", ".from"}, {"Until", ".until"}}, [RowSelector="a.item"])
    in
        HtmlTableDetails,
    GetDetails = Table.AddColumn(#"Replaced Value", "Data", each fnGetDetails([Link])),
    #"Removed Other Columns" = Table.SelectColumns(GetDetails,{"Data"}),
    #"Expanded Data" = Table.ExpandTableColumn(#"Removed Other Columns", "Data", {"Model", "Price", "Transmission", "From", "Until"}, {"Model", "Price", "Transmission", "From", "Until"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Expanded Data",{{"Price", Int64.Type}, {"From", Int64.Type}, {"Until", Int64.Type}}),
    #"Uppercased Text" = Table.TransformColumns(#"Changed Type",{{"Model", Text.Upper, type text}}),
    #"Replaced Value1" = Table.ReplaceValue(#"Uppercased Text","E +","E+",Replacer.ReplaceText,{"Model"}),
    Custom1 = Table.TransformColumns( #"Replaced Value1", {"Model", each List.Transform( Text.Split(_, " "), each Text.Trim(_))}),
    #"Added Custom" = Table.AddColumn(Custom1, "Length", each List.Count([Model])),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Transmission"}),
    #"Changed Type1" = Table.Buffer( Table.TransformColumnTypes(#"Removed Columns",{{"Price", Int64.Type}, {"Until", Int64.Type}, {"From", Int64.Type}, {"Length", Int64.Type}}))
in #"Changed Type1"

 

 

 

In the below video, we will retrieve the list of occasions, including their mileage and age. From there we will try to match each car with an entry in our table containing the catalog prices.

 

There is not much to add to the video except for sharing the code:

 

 

(p) => let
    Source = Web.BrowserContents(Url & "&page=" & p,   [WaitFor = [Timeout = #duration(0,0,0,10)]]),
    JsonDocument = Json.Document( Text.Split( Text.Split( Source, "<script id=""__NEXT_DATA__"" type=""application/json"">"){1}, "</script>"){0}),
    props = JsonDocument[props],
    pageProps = props[pageProps],
    initialState = pageProps[initialState],
    searchReducer = initialState[searchReducer],
    occasions = searchReducer[occasions],
    #"Converted to Table" = Table.FromList(occasions, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"id", "title", "model", "modelId", "brand", "brandId", "price", "img", "year", "bodyType", "fuel", "isImported", "provider", "licensePlate", "km", "description", "nameEntry", "place", "province", "endAuction", "dealerInfo", "popup", "portalListDescription", "portals", "schemaOrg"}, {"id", "title", "model", "modelId", "brand", "brandId", "price", "img", "year", "bodyType", "fuel", "isImported", "provider", "licensePlate", "km", "description", "nameEntry", "place", "province", "endAuction", "dealerInfo", "popup", "portalListDescription", "portals", "schemaOrg"}),
    #"Uppercased Text" = Table.TransformColumns(#"Expanded Column1",{{"title", Text.Upper, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Uppercased Text","E +","E+",Replacer.ReplaceText,{"title"}),
    #"Added Custom" = Table.AddColumn(#"Replaced Value", "Model.1", each List.Transform( Text.Split([title], " "), each Text.Trim(_))),
    #"Renamed Columns" = Table.RenameColumns(#"Added Custom",{{"Model.1", "ModelInfo"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"year", Int64.Type}}),
    //ExampleInfo = #"Changed Type"[ModelInfo]{1},
    //ExampleYear = #"Renamed Columns"[year]{1},
    fnGetCatalogPrice = (Model, Year) =>
    let
        ListPrices = ListPrices,
        InterSectCount = Table.AddColumn(ListPrices, "InterSectCount", each List.Count( List.Intersect({[Model], Model}))),
        YearFilter = Table.AddColumn(InterSectCount, "FilterYear", each Year >= [From] and Year <=[Until]),
        #"Filtered Rows" = Table.SelectRows(YearFilter, each ([FilterYear] = true)),
        #"Sorted Rows" = Table.Sort(#"Filtered Rows",{{"InterSectCount", Order.Descending}}),
        SelectValue = if Table.RowCount(#"Sorted Rows") > 0 then #"Sorted Rows"[Price]{0} else null
    in
        SelectValue,
    GetCatalogPrice = Table.AddColumn(#"Changed Type", "CatalogPrice", each fnGetCatalogPrice([ModelInfo], [year])),
    #"Removed Other Columns" = Table.SelectColumns(GetCatalogPrice,{"id", "title", "price", "year", "fuel", "provider", "licensePlate", "km", "description", "nameEntry", "place", "province", "endAuction", "ModelInfo", "CatalogPrice"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Other Columns",{{"endAuction", type datetime}, {"CatalogPrice", Int64.Type}, {"km", Int64.Type}, {"price", Int64.Type}})
    in #"Changed Type1"

 

 

 

 

let
    Source = List.Generate(() => 1, each _ <= NumberOfPages, each _ + 1),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Changed Type" = Table.TransformColumnTypes(#"Converted to Table",{{"Column1", type text}}),
    #"Renamed Columns" = Table.RenameColumns(#"Changed Type",{{"Column1", "Page"}}),
    #"Added Custom" = Table.AddColumn(#"Renamed Columns", "Data", each fnCarData([Page])),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Data"}),
    #"Expanded Data" = Table.ExpandTableColumn(#"Removed Other Columns", "Data", {"id", "title", "price", "year", "fuel", "provider", "licensePlate", "km", "description", "nameEntry", "place", "province", "endAuction", "ModelInfo", "CatalogPrice"}, {"id", "title", "price", "year", "fuel", "provider", "licensePlate", "km", "description", "nameEntry", "place", "province", "endAuction", "ModelInfo", "CatalogPrice"}),
    #"Added Custom1" = Table.AddColumn(#"Expanded Data", "age", each 2022 - [year]),
    #"Changed Type1" = Table.TransformColumnTypes(#"Added Custom1",{{"age", Int64.Type}, {"price", Int64.Type}, {"id", Int64.Type}}),
    #"Removed Duplicates" = Table.Distinct(#"Changed Type1", {"id", "price", "year", "fuel", "licensePlate", "km", "CatalogPrice", "age"})
in
    #"Removed Duplicates"​

 

 

 

Conclusion

This concludes the first article in the series of 3 articles. Stay tuned to receive the next article where we will predict the car prices / fit a multiple linear regression model using R in Power BI!

This article is a re-post from Medium, I am the original author. 

Comments