Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

spg_vizcube

ETL Through Multiple Pages of a Website in Power BI

In this article, we are going to go through an example of performing ETL (Extraction, Transformation, Load) function by looping through multiple pages on a website.

 

Pre-extraction

For this, we are going to use a website called planecrashinfo.com, which has historical data of all the plane crashes from 1920 onwards.

 
Picture1.png

 

Now if you click on Database Page, you will find that there are links to many HTML pages, starting from 1920 till 2020, and we would like to extract data for each of the HTML pages here.

 
Picture2.png

 

To start, we are going to open one of the pages randomly and extract the data of that web page in Power BI. In this example, I select 1922. Below is how 1922 looks like.

Picture3.png

 

To extract this data in Power BI, we would simply open Power BI desktop and select the Get Data option -> Web. To follow the steps easily, I have embedded a GIF file here.

 

Get Data in Power BI DesktopGet Data in Power BI Desktop

 

 

Once you have selected the table, hit Transform Data and go to Power Query Editor.

 

[Edit] - One thing I noticed in the latest #powerbi version is that when I import a web page with Table, it calls Web[.]BrowserContents(URL) method. While this method is doing the intended, this eliminates any line breaks characters your text might have. (A must for text transformation). One quick hack I found is replacing it with Web[.]Page(Web[.]Contents(URL)). You can also refer to my YouTube Video for a step-by-step approach.

 

Transform

In this step, we are going to transform this data a bit,

  • Split the Location/Operator by line breaks [#(cr)] to make two columns, Location, and Operator.

  • Split the Aircraft Type/Registration by line breaks [#(cr)] to make two columns, Aircraft Type, and Registration.

  • Split the fatalities column by / and ( into three columns i.e. Air Fatalities, Aboard, Ground Fatalities.

 

Using Parameters

The next step is to define a Parameter, We can define a parameter using the Manage Parameters option in the Query editor, however, here we would use Advanced Editor and use M Code to define our parameter.

 

Using Advanced Editor to pass on a ParameterUsing Advanced Editor to pass on a Parameter

 

To test this, we can trigger the AircraftEvents function and enter some years manually, and this will create a table in the Query Editor with the data from that specific year.

 

Now, since we want data for all the Years from this website, we would import the list of years from this website, remove the unwanted column and unpivot multiple columns into one column. Rename it as “year".

 

Extracting year list from the websiteExtracting year list from the website

 

Now, since we have two queries, the first one being AircraftEvents, which is a functional query, and the second one a list of all the years from the website, we need to find a way to pass this list of years to the AircraftEvents function.

 

To do this, we will add a custom column in the Year_List query and invoke the function AircraftEvents with the argument [Year], where [Year] is the name of the column in the Year_List table.

 

= AircraftEvents([Year])

 

This will give us a custom column with many tables, one table for each year. Once we expand the table, we get a consolidated data table with Plane crash data of all the years.

 

Calling AircraftEvents table in Year_List tableCalling AircraftEvents table in Year_List table

 

Your dataset is ready for you to do cool visualizations.