cancel
Showing results for 
Search instead for 
Did you mean: 

Fabric is Generally Available. Browse Fabric Presentations. Work towards your Fabric certification with the Cloud Skills Challenge.

Reply
rootsmusic
Helper I
Helper I

split .html into rows

In Power Query's Advanced Editor, I have the following:

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Added Custom" = Table.AddColumn(#"Extracted Table From Html", "Custom", 
        each Html.Table([Column1], {{"ExtractedText",":root"}})),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom", {"Column1"}),
    #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Columns", "Custom", 
        {"ExtractedText"}, {"ExtractedText"})
in
    #"Expanded Custom"

It imports the source .html file into a single cell.  How can I split the single cell into rows?  Thanks.

1 ACCEPTED SOLUTION
AlexisOlson
Super User
Super User

This is not a friendly format to scrape data from but here's a quick and dirty attempt.

 

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> "                                                      VOTES  PERCENT" and [Column1] <> "          Vote For Not More Than  1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({"     "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({".  .  ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
    #"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
    #"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
    #"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
    #"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
    #"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
    #"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
    #"Filtered Rows3"

View solution in original post

3 REPLIES 3
AlexisOlson
Super User
Super User

This is not a friendly format to scrape data from but here's a quick and dirty attempt.

 

let
    Source = Web.BrowserContents(
        "https://raw.githubusercontent.com/openelections/openelections-sources-pa/master/2018/general/Forest_PA_2018_general.html"),
    #"Extracted Table From Html" = 
        Html.Table(Source, {{"Column1", "BODY"}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Extracted Table From Html", {{"Column1", Splitter.SplitTextByDelimiter("<HTML>", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Trimmed Text" = Table.TransformColumns(#"Split Column by Delimiter",{{"Column1", Text.Trim, type text}}),
    #"Replaced Value" = Table.ReplaceValue(#"Trimmed Text","#(lf)","|",Replacer.ReplaceText,{"Column1"}),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1, Int64.Type),
    #"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Added Index", {{"Column1", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Filtered Rows" = Table.SelectRows(#"Split Column by Delimiter1", each ([Column1] <> "" and [Column1] <> "                                                      VOTES  PERCENT" and [Column1] <> "          Vote For Not More Than  1" and [Column1] <> "</HTML>" and [Column1] <> "</PRE>" and [Column1] <> "<PRE>")),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Filtered Rows", "Column1", Splitter.SplitTextByEachDelimiter({"     "}, QuoteStyle.Csv, false), {"Column1.1", "Column1.2"}),
    #"Split Column by Delimiter3" = Table.SplitColumn(#"Split Column by Delimiter2", "Column1.2", Splitter.SplitTextByEachDelimiter({".  .  ."}, QuoteStyle.Csv, false), {"Column1.2.1", "Column1.2.2"}),
    #"Trimmed Text1" = Table.TransformColumns(#"Split Column by Delimiter3",{{"Column1.2.1", each Text.Trim(_, {" ", "."}), type text}, {"Column1.2.2", each Text.Trim(_, {" ", "."}), type text}}),
    #"Split Column by Delimiter4" = Table.SplitColumn(#"Trimmed Text1", "Column1.2.2", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"Column1.2.2.1", "Column1.2.2.2"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter4",{{"Column1.1", type text}, {"Column1.2.2.1", Int64.Type}, {"Column1.2.2.2", type number}}),
    #"Filtered Rows1" = Table.SelectRows(#"Changed Type", each ([Column1.1] = "")),
    #"Replaced Value1" = Table.ReplaceValue(#"Filtered Rows1","",each if Text.StartsWith([Column1.2.1], "0") then [Column1.2.1] else null,Replacer.ReplaceValue,{"Column1.1"}),
    #"Filled Down" = Table.FillDown(#"Replaced Value1",{"Column1.1"}),
    #"Filtered Rows2" = Table.SelectRows(#"Filled Down", each not Text.StartsWith([Column1.2.1], "0")),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows2", "Custom", each if Text.Contains([Column1.2.1], "(") or [Column1.2.1] = "WRITE-IN" then null else [Column1.2.1], type text),
    #"Filled Down1" = Table.FillDown(#"Added Custom",{"Custom"}),
    #"Filtered Rows3" = Table.SelectRows(#"Filled Down1", each Text.Contains([Custom], "- TOTAL") or [Custom] <> [Column1.2.1])
in
    #"Filtered Rows3"

Thanks @AlexisOlson!  (Your code works, and I'll study it.)  I'm scraping other sources with a similar tabular formatting but with more pages.  Is there a file format to which I can convert the .html source and which will readily parse the text content into tables?  Then I think the parsed tables can be more easily cleaned in Power Query.

It's really dependent on the source. The one you linked to doesn't have any HTML tables, in fact, the HTML tags it has aren't even properly nested. It's basically just a text file in a poor format for processing.

 

If you're scraping from actual HTML tables, then life is much easier.

Helpful resources

Announcements
PBI November 2023 Update Carousel

Power BI Monthly Update - November 2023

Check out the November 2023 Power BI update to learn about new features.

Community News

Fabric Community News unified experience

Read the latest Fabric Community announcements, including updates on Power BI, Synapse, Data Factory and Data Activator.

Power BI Fabric Summit Carousel

The largest Power BI and Fabric virtual conference

130+ sessions, 130+ speakers, Product managers, MVPs, and experts. All about Power BI and Fabric. Attend online or watch the recordings.