Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
rootsmusic
Helper I
Helper I

split .html into rows

In Power Query's Advanced Editor, I have the following:

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Added Custom" = Table.AddColumn(#"Extracted Table From Html", "Custom", 
        each Html.Table([Column1], {{"ExtractedText",":root"}})),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom", {"Column1"}),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Columns", "Custom", 
        {"ExtractedText"}, {"ExtractedText"})
in
    #"Expanded Custom"

It imports the source .html file into a single cell.  How can I split the single cell into rows?  Thanks.

1 ACCEPTED SOLUTION
AlexisOlson
Super User
Super User

This is not a friendly format to scrape data from but here's a quick and dirty attempt.

 

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> "                                                      VOTES  PERCENT" and [Column1] <> "          Vote For Not More Than  1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({"     "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({".  .  ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
    #"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
    #"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
    #"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
    #"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
    #"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
    #"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
    #"Filtered Rows3"

View solution in original post

3 REPLIES 3
AlexisOlson
Super User
Super User

This is not a friendly format to scrape data from but here's a quick and dirty attempt.

 

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> "                                                      VOTES  PERCENT" and [Column1] <> "          Vote For Not More Than  1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({"     "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({".  .  ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
    #"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
    #"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
    #"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
    #"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
    #"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
    #"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
    #"Filtered Rows3"

Thanks @AlexisOlson!  (Your code works, and I'll study it.)  I'm scraping other sources with a similar tabular formatting but with more pages.  Is there a file format to which I can convert the .html source and which will readily parse the text content into tables?  Then I think the parsed tables can be more easily cleaned in Power Query.

It's really dependent on the source. The one you linked to doesn't have any HTML tables, in fact, the HTML tags it has aren't even properly nested. It's basically just a text file in a poor format for processing.

 

If you're scraping from actual HTML tables, then life is much easier.

Helpful resources

Announcements
Fabcon_Europe_Social_Bogo

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

Power BI Carousel June 2024

Power BI Monthly Update - June 2024

Check out the June 2024 Power BI update to learn about new features.

RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

Top Solution Authors
Top Kudoed Authors