The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
In Power Query's Advanced Editor, I have the following:
let
Source = Web.BrowserContents(
"https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
#"Extracted Table From Html" =
Html.Table(Source, {{"Column1", "BODY"}}),
#"Added Custom" = Table.AddColumn(#"Extracted Table From Html", "Custom",
each Html.Table([Column1], {{"ExtractedText",":root"}})),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom", {"Column1"}),
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Columns", "Custom",
{"ExtractedText"}, {"ExtractedText"})
in
#"Expanded Custom"
It imports the source .html file into a single cell. How can I split the single cell into rows? Thanks.
Solved! Go to Solution.
This is not a friendly format to scrape data from but here's a quick and dirty attempt.
let
Source = Web.BrowserContents(
"https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
#"Extracted Table From Html" =
Html.Table(Source, {{"Column1", "BODY"}}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
#"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> " VOTES PERCENT" and [Column1] <> " Vote For Not More Than 1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
#"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
#"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({". . ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
#"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
#"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
#"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
#"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
#"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
#"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
#"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
#"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
#"Filtered Rows3"
This is not a friendly format to scrape data from but here's a quick and dirty attempt.
let
Source = Web.BrowserContents(
"https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
#"Extracted Table From Html" =
Html.Table(Source, {{"Column1", "BODY"}}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
#"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> " VOTES PERCENT" and [Column1] <> " Vote For Not More Than 1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
#"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
#"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({". . ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
#"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
#"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
#"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
#"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
#"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
#"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
#"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
#"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
#"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
#"Filtered Rows3"
Thanks @AlexisOlson! (Your code works, and I'll study it.) I'm scraping other sources with a similar tabular formatting but with more pages. Is there a file format to which I can convert the .html source and which will readily parse the text content into tables? Then I think the parsed tables can be more easily cleaned in Power Query.
It's really dependent on the source. The one you linked to doesn't have any HTML tables, in fact, the HTML tags it has aren't even properly nested. It's basically just a text file in a poor format for processing.
If you're scraping from actual HTML tables, then life is much easier.
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Power BI update to learn about new features.