Solved: Re: Working with .htm document

Patryk_PL_92 · ‎01-19-2023

Hi Experts !

I am searching for some tips regarding extracting data from .htm document.

I get reports in .htm files from SAP system and need to extract some amounts for further analysis.

Unfortunately it seems to be very hard to format the data as all columns from .htm document loads to PQ as a single table and each column has a lot of additional spaces in random places, so I can not split to columns even if I try some tricks.

Below I attach a snip what I get after expanding html tables to the deepes level.

In row 1 I got names of columns I want to promote to headers.

Any idea what approach to take to be able to split that into columns like:
CoCd; G/L Acct; Short Text; Crcy; BusA; Balance carryforward etc. ??

I already tried to use "Replace values" and replace all single spaces to "-" (to keep the structure of text where single space should stay) and then replace all double spaces to single space, but it didn't work 😞

Would be grateful for any advice 🙂

BA_Pete · ‎01-19-2023

Hi @Patryk_PL_92 ,

Here's a query that converts this:

...into this:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WckxOzi/NK1Fwzs8pzc1TgAOXxJJUDMGwxJxSuKghDnEjpVgdoLlOzgr4gJGRvqGRvpGBkRGyqKGBgQFeXUB5sPEurm54jVcwMNE3MMUwX8ECvwWWQKAUGwsA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),

    columnHeaders = List.RemoveNulls(List.ReplaceValue(Text.Split(Source[Column1]{0}, "  "),"",null,Replacer.ReplaceValue)),
    remTopRow = Table.Skip(Source,1),
    replaceSpace = Table.ReplaceValue(remTopRow," ","<>",Replacer.ReplaceText,{"Column1"}),
    replaceGtLt = Table.ReplaceValue(replaceSpace,"><","",Replacer.ReplaceText,{"Column1"}),
    splitByLtGt = Table.SplitColumn(replaceGtLt, "Column1", Splitter.SplitTextByDelimiter("<>", QuoteStyle.Csv), List.Transform({1..List.NonNullCount(columnHeaders)}, each Text.From(_))),
    renCols = Table.RenameColumns(splitByLtGt, List.Zip({Table.ColumnNames(splitByLtGt), columnHeaders}))
in
    renCols

Summary:

columnHeaders = Split out the column names that you want to keep from the source into a list - we'll use this later

remTopRow = Now we have our headers, we can just get rid of the first table row to keep things tidy later

replaceSpace & replaceGtLt = Use the old "<>" => "><" replace trick to get down to a singular instance of a delimiter

splitByLtGt = Split columns by our singular "<>" delimiter into as many columns as we need based on the count of columnHeaders

renCols = Overwrite your new column headers with the values in columnHeaders

This does assume that none of your fields have spaces in them. If they do, then you can adjust the replaceSpace step to look for a double-space " ", and replace with "<><>", but you'll have to trim all your column values before assigning data types to remove any straggling spaces.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

View solution in original post

BA_Pete · ‎01-19-2023

Hi @Patryk_PL_92 ,

Here's a query that converts this:

...into this:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WckxOzi/NK1Fwzs8pzc1TgAOXxJJUDMGwxJxSuKghDnEjpVgdoLlOzgr4gJGRvqGRvpGBkRGyqKGBgQFeXUB5sPEurm54jVcwMNE3MMUwX8ECvwWWQKAUGwsA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),

    columnHeaders = List.RemoveNulls(List.ReplaceValue(Text.Split(Source[Column1]{0}, "  "),"",null,Replacer.ReplaceValue)),
    remTopRow = Table.Skip(Source,1),
    replaceSpace = Table.ReplaceValue(remTopRow," ","<>",Replacer.ReplaceText,{"Column1"}),
    replaceGtLt = Table.ReplaceValue(replaceSpace,"><","",Replacer.ReplaceText,{"Column1"}),
    splitByLtGt = Table.SplitColumn(replaceGtLt, "Column1", Splitter.SplitTextByDelimiter("<>", QuoteStyle.Csv), List.Transform({1..List.NonNullCount(columnHeaders)}, each Text.From(_))),
    renCols = Table.RenameColumns(splitByLtGt, List.Zip({Table.ColumnNames(splitByLtGt), columnHeaders}))
in
    renCols

Summary:

columnHeaders = Split out the column names that you want to keep from the source into a list - we'll use this later

remTopRow = Now we have our headers, we can just get rid of the first table row to keep things tidy later

replaceSpace & replaceGtLt = Use the old "<>" => "><" replace trick to get down to a singular instance of a delimiter

splitByLtGt = Split columns by our singular "<>" delimiter into as many columns as we need based on the count of columnHeaders

renCols = Overwrite your new column headers with the values in columnHeaders

This does assume that none of your fields have spaces in them. If they do, then you can adjust the replaceSpace step to look for a double-space " ", and replace with "<><>", but you'll have to trim all your column values before assigning data types to remove any straggling spaces.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

Patryk_PL_92 · ‎01-19-2023

Hi!

thanks for your effort and detailed explanation 🙂

seems that working when you work with .htm documents you need to be very creative and use some tricks to gather any useful information, that set of steps is really nice. It does not solve my problem entirely, but it took me to another stage where I need to do something with data in rows to split into columns correctly

For some reason "replace values" (space to anything) is not working now, so I am stuck, also I found another blockers after that, so I give up on that point and will ask to send me files in other format - like .csv/ .xls /.xml etc.

I do not recommend anybody to work with .htm documents, it is a nightmare 😮

I accept that as a solution, because it solves a part of the problem and it took me futher 🙂

Working with .htm document

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024