Fabric is Generally Available. Browse Fabric Presentations. Work towards your Fabric certification with the Cloud Skills Challenge.
I am looking for a way to extract hex color codes from the code of a webpage. I have a number of websites that we manage and would like to scrape the site to extract their color schemes for marketing research purposes.
Hex color codes are always #CCC or #CCCCCC, example:
color: #FFF;
background: #57C2BD;
So if I can import the html for the site, I am fairly confident I can extract these color codes. The issue I am having is finding a way to get PowerBI to import the raw html of the page versus the document/table format which is cumbersom and only appears to let me get to the visible text on the page.
Any suggestions would be appreciated.
Solved! Go to Solution.
A couple of ways.
Something like this with but it's not pretty and doesn't always get the right results depending on the page.
let
Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://www.microsoft.com/en-us/"))}),
#"Filtered Rows" = Table.SelectRows(Source, each Text.Contains([Column1], "http")),
#"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each Text.Contains([Column1], "href")),
#"Replaced Value" = Table.ReplaceValue(#"Filtered Rows1","href","~href",Replacer.ReplaceText,{"Column1"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Column1", Splitter.SplitTextByEachDelimiter({"~"}, QuoteStyle.None, false), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Filtered Rows2" = Table.SelectRows(#"Split Column by Delimiter", each Text.Contains([Column1], "href")),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows2", "Column1", Splitter.SplitTextByEachDelimiter({""""}, QuoteStyle.None, false), {"Column1.1", "Column1.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Column1.1", type text}, {"Column1.2", type text}}),
#"Split Column by Delimiter2" = Table.SplitColumn(#"Changed Type", "Column1.2", Splitter.SplitTextByEachDelimiter({""""}, QuoteStyle.None, true), {"Column1.2.1", "Column1.2.2"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter2",{{"Column1.2.1", type text}, {"Column1.2.2", type text}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type1",{{"Column1.2.1", "Url"}}),
#"Filtered Rows3" = Table.SelectRows(#"Renamed Columns", each Text.StartsWith([Url], "http")),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows3",{"Column1.1", "Column1.2.2"})
in
#"Removed Columns"
a more reliable way would be to use r and a webscraping library
library(rvest) page=read_html("WEBSITE") data <- as.data.frame(html_attr(html_nodes(page, "a"), "href"))
If you have firefox there is an eyedropper in the developer tools that shows the hex value when you hover over.
https://developer.mozilla.org/en-US/docs/Tools/Eyedropper
Chrome dev tools have a similar tool, but its not as easy to get to.
https://paul.kinlan.me/eyedropper-chrome-dev-tools/
I aware of these tools, however we need a way to programatically extract the hex color codes. We have a large number of websites we need to scrape.
OK.
Something like this should work. I used this technique it to strip out links form a web site before.
From here you can split and filter your html source code.
let Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://www.microsoft.com/en-us/"))}) in Source
Thanks, curious to know how you extract the URLs exactly.
A couple of ways.
Something like this with but it's not pretty and doesn't always get the right results depending on the page.
let
Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://www.microsoft.com/en-us/"))}),
#"Filtered Rows" = Table.SelectRows(Source, each Text.Contains([Column1], "http")),
#"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each Text.Contains([Column1], "href")),
#"Replaced Value" = Table.ReplaceValue(#"Filtered Rows1","href","~href",Replacer.ReplaceText,{"Column1"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Replaced Value", {{"Column1", Splitter.SplitTextByEachDelimiter({"~"}, QuoteStyle.None, false), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Filtered Rows2" = Table.SelectRows(#"Split Column by Delimiter", each Text.Contains([Column1], "href")),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows2", "Column1", Splitter.SplitTextByEachDelimiter({""""}, QuoteStyle.None, false), {"Column1.1", "Column1.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Column1.1", type text}, {"Column1.2", type text}}),
#"Split Column by Delimiter2" = Table.SplitColumn(#"Changed Type", "Column1.2", Splitter.SplitTextByEachDelimiter({""""}, QuoteStyle.None, true), {"Column1.2.1", "Column1.2.2"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter2",{{"Column1.2.1", type text}, {"Column1.2.2", type text}}),
#"Renamed Columns" = Table.RenameColumns(#"Changed Type1",{{"Column1.2.1", "Url"}}),
#"Filtered Rows3" = Table.SelectRows(#"Renamed Columns", each Text.StartsWith([Url], "http")),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows3",{"Column1.1", "Column1.2.2"})
in
#"Removed Columns"
a more reliable way would be to use r and a webscraping library
library(rvest) page=read_html("WEBSITE") data <- as.data.frame(html_attr(html_nodes(page, "a"), "href"))
Check out the November 2023 Power BI update to learn about new features.
Read the latest Fabric Community announcements, including updates on Power BI, Synapse, Data Factory and Data Activator.
Join us for a free, hands-on Microsoft workshop led by women trainers for women where you will learn how to build a Dashboard in a Day!