Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

v-lili6-msft

Batch crawling web page data

Scenario:  

Through this article, we will learn how to use PowerBI desktop to bulk collect data from multiple web pages instead of crawling one page of date only while we are extracting data from one web.

 

Sample web site: Power BI visuals in AppSource

 

Step 1: Analysis of the structure of the website

Scroll down to the bottom of the page and find where the page number is displayed. Click on the first four pages. The URL is as follows:

 

https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=1

https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=2

https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=3

https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=4

 

Now we could find that the last number is the ID of the page number, which is the variable that controls the paging data.

 

Step 2: Pull out the first page of data using PowerBI

Open the Power BI Desktop, get the data from the web page, select Advanced in the pop-up window, according to the above analysis of the URL structure, enter the URL in the first line except the ID of the last page, and enter the page number in the second line: 

v-lili6-msft_0-1615453424163.png

 

 

See the URL preview, the two rows above have been automatically merged together; The separate input here is just to make it a clearer distinction between page number variables, in fact, It is also possible to enter the full URL directly. 

 

By the way, If the page number variable is not the last digit, but is in the middle, enter the URL in three lines.

 

Click OK to continue

v-lili6-msft_1-1615453424170.png

 

 

Click OK to enter Power Query

 

v-lili6-msft_2-1615453424186.png

 

Now the first page of data has been imported in, the data structure will be consistent with the data structure of the first page when collecting other pages below, so they can be used directly.

Step 3: Set the custom function based on the page number parameter

Open the "Advanced Editor" for this query, and enter:

(p as number) as table =>

Change the "1" after the "&" in the first line after the" let ":

(Number.ToText(p))

v-lili6-msft_3-1615453424191.jpeg

 

 

Then click Done, the query of the data in the first page will become a custom function.

v-lili6-msft_4-1615453424192.png

 

P is the variable of the function that controls  the page number, Simply entering a number, such as 2, will pull out the data on page 2. Input parameters can only crawl one webpage  at a time, now we will Crawl webpage in batch by the next step. 

 

Step 4: Batch calls to custom functions

First build a sequence of numbers using the blank query. If you want to crawl the first 20 pages of data, build a sequence from 1 to 20 and enter in the empty query

= {1.. 20}

Enter generates a sequence from 1 to 20, Then converts it into a table and rename column1

v-lili6-msft_5-1615453424196.png

 

v-lili6-msft_6-1615453424199.png

 

 

Then we call the custom function

v-lili6-msft_7-1615453424203.png

 

Click the "Function Query" drop-down box, and select the custom function Table 1 created just now

v-lili6-msft_8-1615453424204.png

 

 

Click OK and start to crawl the webpage in batches.

v-lili6-msft_9-1615453424206.png

 

Now expand “Table 1” column and since there is no data after 6 pages, then filter out null values with column1.

v-lili6-msft_10-1615453424214.png

 

 

Step 5: Batch change to column names

After we crawl web page data in batches in power query, then we need to transform the data. We find that all the column names are automatically generated as Column1, Column2,…

Now we can double-click the column name directly and then rename it one by one, and if there are many column names that need to be changed, using the M function: Table.renamecolumns to do batch work.

  1. Make a column name table, enter the original and new column names and import them into power bi

v-lili6-msft_11-1615453424215.png

 

  1. Transpose this column name table

v-lili6-msft_12-1615453424220.png

 

  1. Insert step after “Transposed Table” step to convert this table to list:

= Table.ToColumns(#"Transposed Table")

v-lili6-msft_13-1615453424223.png

 

v-lili6-msft_14-1615453424227.png

 

  1. Now in the query table that needs to be changed, insert a step as below:

= Table.RenameColumns(#"Filtered Rows",#"column name table",MissingField.Ignore)

v-lili6-msft_15-1615453424234.png

 

Note: If the list of original column names contains names that do not exist in the table, an error will be reported. To avoid this situation, we can use the third parameter, MissingField.Ignore,

This will automatically skip the column name without an error if there is a mismatch. 

 

Author: Lin Li

Reviewer: Zoey Huan

Comments