March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Scenario:
Through this article, we will learn how to use PowerBI desktop to bulk collect data from multiple web pages instead of crawling one page of date only while we are extracting data from one web.
Sample web site: Power BI visuals in AppSource
Step 1: Analysis of the structure of the website
Scroll down to the bottom of the page and find where the page number is displayed. Click on the first four pages. The URL is as follows:
https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=1
https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=2
https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=3
https://appsource.microsoft.com/en-us/marketplace/apps?product=power-bi-visuals&page=4
Now we could find that the last number is the ID of the page number, which is the variable that controls the paging data.
Step 2: Pull out the first page of data using PowerBI
Open the Power BI Desktop, get the data from the web page, select Advanced in the pop-up window, according to the above analysis of the URL structure, enter the URL in the first line except the ID of the last page, and enter the page number in the second line:
See the URL preview, the two rows above have been automatically merged together; The separate input here is just to make it a clearer distinction between page number variables, in fact, It is also possible to enter the full URL directly.
By the way, If the page number variable is not the last digit, but is in the middle, enter the URL in three lines.
Click OK to continue
Click OK to enter Power Query
Now the first page of data has been imported in, the data structure will be consistent with the data structure of the first page when collecting other pages below, so they can be used directly.
Step 3: Set the custom function based on the page number parameter
Open the "Advanced Editor" for this query, and enter:
(p as number) as table =>
Change the "1" after the "&" in the first line after the" let ":
(Number.ToText(p))
Then click Done, the query of the data in the first page will become a custom function.
P is the variable of the function that controls the page number, Simply entering a number, such as 2, will pull out the data on page 2. Input parameters can only crawl one webpage at a time, now we will Crawl webpage in batch by the next step.
Step 4: Batch calls to custom functions
First build a sequence of numbers using the blank query. If you want to crawl the first 20 pages of data, build a sequence from 1 to 20 and enter in the empty query
= {1.. 20}
Enter generates a sequence from 1 to 20, Then converts it into a table and rename column1
Then we call the custom function
Click the "Function Query" drop-down box, and select the custom function Table 1 created just now
Click OK and start to crawl the webpage in batches.
Now expand “Table 1” column and since there is no data after 6 pages, then filter out null values with column1.
Step 5: Batch change to column names
After we crawl web page data in batches in power query, then we need to transform the data. We find that all the column names are automatically generated as Column1, Column2,…
Now we can double-click the column name directly and then rename it one by one, and if there are many column names that need to be changed, using the M function: Table.renamecolumns to do batch work.
= Table.ToColumns(#"Transposed Table")
= Table.RenameColumns(#"Filtered Rows",#"column name table",MissingField.Ignore)
Note: If the list of original column names contains names that do not exist in the table, an error will be reported. To avoid this situation, we can use the third parameter, MissingField.Ignore,
This will automatically skip the column name without an error if there is a mismatch.
Author: Lin Li
Reviewer: Zoey Huan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.