Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
Anonymous
Not applicable

split .html into rows

In Power Query's Advanced Editor, I have the following:

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Added Custom" = Table.AddColumn(#"Extracted Table From Html", "Custom", 
        each Html.Table([Column1], {{"ExtractedText",":root"}})),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom", {"Column1"}),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Columns", "Custom", 
        {"ExtractedText"}, {"ExtractedText"})
in
    #"Expanded Custom"

It imports the source .html file into a single cell.  How can I split the single cell into rows?  Thanks.

1 ACCEPTED SOLUTION
AlexisOlson
Super User
Super User

This is not a friendly format to scrape data from but here's a quick and dirty attempt.

 

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> "                                                      VOTES  PERCENT" and [Column1] <> "          Vote For Not More Than  1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({"     "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({".  .  ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
    #"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
    #"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
    #"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
    #"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
    #"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
    #"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
    #"Filtered Rows3"

View solution in original post

3 REPLIES 3
AlexisOlson
Super User
Super User

This is not a friendly format to scrape data from but here's a quick and dirty attempt.

 

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> "                                                      VOTES  PERCENT" and [Column1] <> "          Vote For Not More Than  1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({"     "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({".  .  ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
    #"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
    #"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
    #"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
    #"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
    #"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
    #"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
    #"Filtered Rows3"
Anonymous
Not applicable

Thanks @AlexisOlson!  (Your code works, and I'll study it.)  I'm scraping other sources with a similar tabular formatting but with more pages.  Is there a file format to which I can convert the .html source and which will readily parse the text content into tables?  Then I think the parsed tables can be more easily cleaned in Power Query.

It's really dependent on the source. The one you linked to doesn't have any HTML tables, in fact, the HTML tags it has aren't even properly nested. It's basically just a text file in a poor format for processing.

 

If you're scraping from actual HTML tables, then life is much easier.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.