Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Shape the future of the Fabric Community! Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions. Take survey.

Reply
RokuCap
Helper I
Helper I

Creating a Web Query to scrape data from multiple webpages

I was following along to an exercise to 'Web scrape with Power Query' (Link). It involves using a formula in Power Query to import 'Major League Base Detailed Standings' table data from the 'baseball-reference' website for different years (by changing the YYYY in the URL from 2000 to 2020).

 

Steps:

1. Create a query with [Years] column with values from 2000 - 2020

2. Create a second query with the m-code below. It references Table ID 7 in the HTML which should correspond to the 'Major League Base Detailed Standings' table. @9:20 mark

Code 

 

= (year as number) =>
let
   Source = Web.Page(Web.Contents("https://www.baseball-reference.com/leagues/majors/"& Text.From(year) & "-standings.shtml")),
       Data = Source{7}[Data] //Seventh table on the page with ID starting at zero @9.20 min

in
Data

 

 

3. Add a "Invoke Custom Function" column in the first query (years column), using the year as the input to the second query.

 

This step returns Error values against each year:

RokuCap_0-1699951141834.png

 

I changed the table ID in Data = Source{X}[Data] above. X=1-->5 all work fine so I assume it's due to an incorrect Table ID.

 

What m-code would return the 'Major League Base Detailed Standings' table from the Link?

 

Also what changed in the HTML that caused the m-code above to stop working?

 

Thanks in advance!

1 ACCEPTED SOLUTION
PhilipTreacy
Super User
Super User

Hi @RokuCap 

 

Let me check, give me 10 minutes.  A quick look at the web page structure shows me that the actual table of data you want is in HTML comments, meaning the web browser wouldn't display it.

 

So the fact we are seeing it means there's something else going on like maybe JavaScript is reading the table and rendering it to screen.  Unusual.

 

In the meantime I've created this example PBIX file  that loads the data you want via a function.

 

Phil

 



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


View solution in original post

5 REPLIES 5
PhilipTreacy
Super User
Super User

Hi @RokuCap 

 

It doesn't look like you can use the 'old' way of referring to the table because as I said, the table you want isn't visible to the PQ when loaded using Web.Page(Web.Contents()) - what it sees is this

 

mlb-tab.png

 

So the 7th table doesn't exist - hence the error you are seeing.

 

The reason the table you want is 'hidden' is because it's in a section of HTML comments.

 

However by using the Web.BrowserContents function in the query I created, PQ can parse the the web page code to find the table.

 

Regards

 

Phil

 



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


PhilipTreacy
Super User
Super User

Hi @RokuCap 

 

Let me check, give me 10 minutes.  A quick look at the web page structure shows me that the actual table of data you want is in HTML comments, meaning the web browser wouldn't display it.

 

So the fact we are seeing it means there's something else going on like maybe JavaScript is reading the table and rendering it to screen.  Unusual.

 

In the meantime I've created this example PBIX file  that loads the data you want via a function.

 

Phil

 



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


Hi Phil,

 

Your PBIX file works perfectly in edit mode but I get the following error when I try to load the data. Do you get the same errors? It loads really slowly to the point of timing out. I assumed it had something to do with the source website or even my internet connection.

RokuCap_0-1700385834747.png

Thanks again

PhilipTreacy
Super User
Super User

Hi @RokuCap 

 

I imagine the structure of the webpage has chnaged so the reference to the table you want has changed.

 

To test this you can create a new blank query and load this URL 

 

https://www.baseball-reference.com/leagues/majors/2020-standings.shtml

 

PQ presents you with this

 

PhilipTreacy_0-1699956355539.png

 

Check the table you want and click OK to load it.

 

Here's the query to get that table

 

 

let
    Source = Web.BrowserContents("https://www.baseball-reference.com/leagues/majors/2020-standings.shtml"),
    #"Extracted Table From Html" = Html.Table(Source, {{"Column1", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(1)"}, {"Column2", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(2)"}, {"Column3", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(3)"}, {"Column4", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(4)"}, {"Column5", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(5)"}, {"Column6", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(6)"}, {"Column7", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(7)"}, {"Column8", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(8)"}, {"Column9", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(9)"}, {"Column10", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(10)"}, {"Column11", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(11)"}, {"Column12", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(12)"}, {"Column13", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(13)"}, {"Column14", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(14)"}, {"Column15", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(15)"}, {"Column16", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(16)"}, {"Column17", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(17)"}, {"Column18", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(18)"}, {"Column19", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(19)"}, {"Column20", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(20)"}, {"Column21", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(21)"}, {"Column22", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(22)"}, {"Column23", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(23)"}, {"Column24", "TABLE[id='expanded_standings_overall'] > * > TR > :nth-child(24)"}}, [RowSelector="TABLE[id='expanded_standings_overall'] > * > TR"]),
    #"Promoted Headers" = Table.PromoteHeaders(#"Extracted Table From Html", [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Rk", Int64.Type}, {"Tm", type text}, {"W", Int64.Type}, {"L", Int64.Type}, {"W-L%", type number}, {"R", type number}, {"RA", type number}, {"Rdiff", type number}, {"SOS", type number}, {"SRS", type number}, {"pythWL", type text}, {"Luck", Int64.Type}, {"vEast", type text}, {"vCent", type text}, {"vWest", type text}, {"Inter", type text}, {"Home", type text}, {"Road", type text}, {"ExInn", type text}, {"1Run", type text}, {"vRHP", type text}, {"vLHP", type text}, {"≥.500", type text}, {"<.500", type text}})
in
    #"Changed Type"

 

 

You'll need to take the relevant parts of that query and put them into your function.

 

Regards

 

Phil

 



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


Thanks Phil,

 

Is the query you posted the only way to reference this table? I.e. there is no longer a table ID associated with that table or something that could be easily referenced within the formula below?

Source{7}[Data]

 

Helpful resources

Announcements
November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

Live Sessions with Fabric DB

Be one of the first to start using Fabric Databases

Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.