Don't miss your chance to take the Fabric Data Engineer (DP-600) exam for FREE! Find out how by attending the DP-600 session on April 23rd (pacific time), live or on-demand.
Learn moreNext up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now
I have written some code to extract the data from a website, but the results are different to what I see in the browser. I noticed this as i have a function running thorugh pages 1 to 22, but I have loads of duplicates
Take this page
https://www.legionellacontrol.org.uk/search/?c=11&q=&r=&p=18
If i just paste that as a web date source in Power query I get completely different companies. See excel here
this is with absolutely no processing. How can this be? is this some clever dynamic listing on the website to stop people scrapping the data? Any advice appreciated
Power query is returning which are not the companies on that web page
| Water Logic Works Ltd |
| Coral Environmental Ltd |
| Titan Water Limited |
| CLIRA Limited |
| Trident Water Solutions Limited |
| Brighter Compliance Limited |
| Direct-Tech Solutions Limited |
| Lucion Services Limited |
| HLA Services Limited |
| Riverside Environmental Services Limited |
| Acorn Environment Services Limited |
| Environmental Essentials UK Ltd T/A EE UK Ltd |
| H2O Environmental Services Ltd |
| Aqua B Group/LTM Compliance |
| Bureau Veritas UK Limited |
Hi @mike_asplin,
Thank you for reaching out to the Microsoft Fabric Community Forum. Also, thanks to @Natarajan_M, @ronrsnfld, for those inputs on this thread.
Has your issue been resolved? If the response provided by the community member @Natarajan_M, @ronrsnfld, addressed your query, could you please confirm? It helps us ensure that the solutions provided are effective and beneficial for everyone.
Hope this helps clarify things and let me know what you find after giving these steps a try happy to help you investigate this further.
Thank you for using the Microsoft Community Forum.
Hi @mike_asplin,
Just wanted to follow up. If the shared guidance worked for you, that’s wonderful hopefully it also helps others looking for similar answers. If there’s anything else you'd like to explore or clarify, don’t hesitate to reach out.
Thank you.
Please note that if, while viewing the table in the PQ UI, you "Refresh Preview", the listed companies will change. That probably explains why what you see is different in Excel vs PQ. Possibly a function of the web site as the query refreshes.
Hi @mike_asplin ,
I was able to extract the results from the webpage using the web connector.
Data :
let
StartPage = 1,
EndPage = 100,
PageList = {StartPage..EndPage},
PageTable = Table.FromList(PageList, Splitter.SplitByNothing(), {"PageNumber"}),
AddURLColumn = Table.AddColumn(
PageTable,
"Source URL",
each "https://www.legionellacontrol.org.uk/search/?c=11&q=&r=&p=" & Text.From([PageNumber])
),
fxScrapePage = (url as text) as table =>
let
Source = Web.Contents(url),
ExtractData = Html.Table(
Source,
{
{"Company Name", "a[href*='/company/?name=']"},
{"Profile Link", "a[href*='/company/?name=']", each [Attributes][href]?},
{"Phone", "a[href^='tel:']"},
{"Email", "a[href^='mailto:']", each Text.Replace([Attributes][href], "mailto:", "")},
{"Website", "a:contains('Visit Website')", each [Attributes][href]?}
},
[RowSelector=".row"]
),
CleanCompanyName = Table.ReplaceValue(ExtractData, "MORE INFORMATION", null, Replacer.ReplaceValue, {"Company Name"}),
FinalFilter = Table.SelectRows(CleanCompanyName, each ([Company Name] <> null))
in
FinalFilter,
GetData = Table.AddColumn(
AddURLColumn,
"ScrapedData",
each fxScrapePage([Source URL])
),
ExpandedData = Table.ExpandTableColumn(
GetData,
"ScrapedData",
{"Company Name", "Profile Link", "Phone", "Email", "Website"}
),
FinalData = Table.SelectRows(ExpandedData, each ([Company Name] <> null)),
#"Removed Columns" = Table.RemoveColumns(FinalData,{"PageNumber", "Source URL"})
in
#"Removed Columns"
File : webdata.pbix
Thanks
If this response was helpful in any way, I’d gladly accept a kudo.
Please mark it as the correct solution. It helps other community members find their way faster
For example I found Aquafine (UK) Ltd on page 3 on the website, but its on your p18. Maybe they have a random generator to mix up the reuslts each time you search as companies are not in alphabetical order so as to not favour companies begining with A. Is that a thing on a website?
I searched 3 times and Aquafine (UK) Ltd is on P3 everytime on the website
In your code its not there
Company Name
| Aqualogic Limited |
| Aquate Consultancy Ltd |
| Chlorination and Chemical Cleaning Services Ltd t/a C3 Water |
| Churchill Environmental Services Limited |
| First Environment Limited |
| Frankham Risk Management Services Limited |
| GEM Water Limited |
| General Environmental Services Limited |
| Guardian Hygiene Services Limited |
| Maddison Water Hygiene Limited |
| Maplin Services Limited |
| Nant Limited |
| Normec Latis Scientific Limited |
| Orion Building Engineering Services Ltd |
| TWC (Services) Limited |
Hi @mike_asplin , Just observed the order of the values are not consistent between browsers
I suspect the values will change based on the browser and the PBI desktop webview is based on Microsoft Edge WebView2 (Chromium-based).
In Chrome Personal profile 1:
in chrome personal profile 2 :
In Edge :
the values keep changing everytime we refresh in the powerquery layer.
in pbi :
Thanks
If this response was helpful in any way, I’d gladly accept a kudo.
Please mark it as the correct solution. It helps other community members find their way faster
So the website is randomly sorting the results each time its queried?
I guess that means its impossible to scrape the data as each time you change pages the output is randomised?
Hi @mike_asplin ,
What I meant is that the sorting is inconsistent across different browsers; it varies depending on the profile (the account from which you are accessing it even with in chrome if i use diffrent accounts its giving diffrent results). Additionally, within the Power BI Webview, the order changes dynamically with each refresh, altering every time you refresh the page.
Thanks.
If this response was helpful in any way, I’d gladly accept a kudo.
Please mark it as the correct solution. It helps other community members find their way faster
My question was more why is the url in power query for page 18 producing a different result to putting the same url for page 18 into a web browser. I have never seen this behaviour before as should be the same data. Plus the fact that createing the url in power query is returning duplicate pages for different page numbers.
I appreciate you showing how to extract the html long hand, but actually power query does a good job extracting the fields its just the wrong page.
Your code has the same issue.
If you comapre the comapnies on this url https://www.legionellacontrol.org.uk/search/?c=11&q=&r=&p=18
to your pbix for the same page completely different companies and wierdly different to my model for the same page. its like they are taking the page number and randonly picking a page. What is even wierder is if I scroll through all 22 pages of that url on the website and check the top company on each page none of them are in the list below. Its like the power query url is going somewhere completely different.
Company Name
| Rock Compliance Limited |
| Reigate Environmental Services Ltd |
| Scottish Water Property |
| Water Environmental Treatment Limited |
| GEM Water Limited |
| Aquafine (UK) Ltd |
| Water Logic Works Ltd |
| Rochester Midland Corporation Limited |
| Pure Drop Water Systems Management Ltd |
| Chemical Treatment Services Ltd |
| H2O Cost Reduction Services Ltd t/a H2O Legionella Control Services |
| Pennington Choices Limited |
| H2O Solutions |
| Life Environmental Services Limited |
| HYDROP Environmental Consultancy Services Limited |
If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.
A new Power BI DataViz World Championship is coming this June! Don't miss out on submitting your entry.
Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.
| User | Count |
|---|---|
| 6 | |
| 3 | |
| 3 | |
| 3 | |
| 2 |