Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
Sumsar
Helper II
Helper II

Query Experts needed - Iterative web scraper running slow

Hi,

 

I have a small project working on a Power BI model for demoing features etc. The report used to run, but recently has become absurdly slow.

 

It first consists of this function:

 

let
MetaCritic = (page as number) as table =>
let
Source = Web.BrowserContents("https://www.metacritic.com/browse/games/score/metascore/all/all/all?view=condensed&sort=desc&page=" & Number.ToText(page), [WaitFor=[Timeout=#duration(0, 5, 0, 0)]]),
#"Extracted Table From Html" = Html.Table(Source, {{"Column2", ".numbered"}, {"Column3", ".numbered + *"}, {"Column4", ".next_to_side_col .data"}, {"Column5", ".platform + *"}, {"Column7", ".score:nth-child(1)"}, {"Column8", ".summary"}}, [RowSelector="TR"])
in
#"Extracted Table From Html"
in
MetaCritic

 

When I feed this function any variable (page number), I get a rather quick response with a table corresponding to the 100 rows of the page number provided as variable.

 

Then I create an iterative query, that should return each page up until the max page (280) in succession:

 

let
Source = {0..280},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "Page"}}),
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "MC", each MC_PageFx_2([Page])),
#"Expanded MC" = Table.ExpandTableColumn(#"Added Custom", "MC", {"Column2", "Column3", "Column4", "Column5", "Column7", "Column8"}),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded MC",{"Page", "Column2", "Column3", "Column4", "Column5", "Column7", "Column8"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"Page", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}, {"Column5", type date}, {"Column7", type number}, {"Column8", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type",{{"Column2", "Rank"}, {"Column3", "Title"}, {"Column4", "Platform"}, {"Column5", "Date"}, {"Column7", "Score"}, {"Column8", "Description"}})
in
#"Renamed Columns1"

When I click close and apply to this, it startes to load rows into the model, 100 at a time, så the counter goes up from 100, to 200, to 300 etc loaded rows. The problem is however, that each 100 rows takes HOURS to load, so It would at least need to be running a week for 280 pages times 100 rows each would load... 

 

My question is, as it is clearly the iterative query that is creating some kind of bottle neck, how can I improve this query? The function returns pages quickly, and entering the page specific urls in a browser, the page takes a few seconds to load...

 

I also tried both parallele and serial query load, but it seems to be the same regardless.

 

 

 

2 ACCEPTED SOLUTIONS
lbendlin
Super User
Super User

This code takes about 6 seconds per 100 entries.  So about half an hour to fetch everything. Adjust the page range as needed.

 

 

let
    Source = {0..20},
    Extract = (page)=> Html.Table(Web.BrowserContents("https://www.metacritic.com/browse/games/score/metascore/all/all/all?view=condensed&sort=desc&page=" & Text.From(page)), 
       {{"Rank", ".numbered"}, {"Name", ".numbered + *"}, {"Platform", ".next_to_side_col .data"}, {"Released", ".platform + *"}, {"User Score", ".user"}, {"Critic Review", ".score:nth-child(1)"}, {"Summary", ".summary"}}, [RowSelector="TR"]),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "Custom", each Extract([Column1])),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Rank", "Name", "Platform", "Released", "User Score", "Critic Review", "Summary"})
in
    #"Expanded Custom"

 

How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

 

View solution in original post

Sumsar
Helper II
Helper II

That worls wonders. Thank you!

 

Would it be possible for it to also determine how many pages are actually present, right now I think it is 271, and then make that number dynamic based on this? Thanks again.

View solution in original post

3 REPLIES 3
Sumsar
Helper II
Helper II

That worls wonders. Thank you!

 

Would it be possible for it to also determine how many pages are actually present, right now I think it is 271, and then make that number dynamic based on this? Thanks again.

Use List.Generate with an exit condition when the page doesn't return any more payload data.  The disadvantage of that approach will be that you have to lug the entire resultset around inside the List.Generate, so you better have plenty of memory.  Or you could play dirty tricks and keep trying to fetch pages, throw them away, determine the last page, then run the original code in the hope that your web browser cache has all the data locally available.

lbendlin
Super User
Super User

This code takes about 6 seconds per 100 entries.  So about half an hour to fetch everything. Adjust the page range as needed.

 

 

let
    Source = {0..20},
    Extract = (page)=> Html.Table(Web.BrowserContents("https://www.metacritic.com/browse/games/score/metascore/all/all/all?view=condensed&sort=desc&page=" & Text.From(page)), 
       {{"Rank", ".numbered"}, {"Name", ".numbered + *"}, {"Platform", ".next_to_side_col .data"}, {"Released", ".platform + *"}, {"User Score", ".user"}, {"Critic Review", ".score:nth-child(1)"}, {"Summary", ".summary"}}, [RowSelector="TR"]),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "Custom", each Extract([Column1])),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Rank", "Name", "Platform", "Released", "User Score", "Critic Review", "Summary"})
in
    #"Expanded Custom"

 

How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

 

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayPowerBICarousel

Power BI Monthly Update - May 2024

Check out the May 2024 Power BI update to learn about new features.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.