Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Patryk_PL_92
Helper I
Helper I

Working with .htm document

Hi Experts !

I am searching for some tips regarding extracting data from .htm document.

I get reports in .htm files from SAP system and need to extract some amounts for further analysis.

 

Unfortunately it seems to be very hard to format the data as all columns from .htm document loads to PQ as a single table and each column has a lot of additional spaces in random places, so I can not split to columns even if I try some tricks.

 

Below I attach a snip what I get after expanding html tables to the deepes level.

 

In row 1 I got names of columns I want to promote to headers.

 

Any idea what approach to take to be able to split that into columns like:
CoCd; G/L Acct; Short Text; Crcy; BusA; Balance carryforward etc. ??

 

I already tried to use "Replace values" and replace all single spaces to "-" (to keep the structure of text where single space should stay) and then replace all double spaces to single space, but it didn't work 😞

 

Would be grateful for any advice 🙂

 

Patryk_PL_92_0-1674115497914.png

Patryk_PL_92_1-1674115671448.png

 

1 ACCEPTED SOLUTION
BA_Pete
Super User
Super User

Hi @Patryk_PL_92 ,

 

Here's a query that converts this:

BA_Pete_0-1674138913124.png

 

...into this:

BA_Pete_1-1674138945958.png

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WckxOzi/NK1Fwzs8pzc1TgAOXxJJUDMGwxJxSuKghDnEjpVgdoLlOzgr4gJGRvqGRvpGBkRGyqKGBgQFeXUB5sPEurm54jVcwMNE3MMUwX8ECvwWWQKAUGwsA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),

    columnHeaders = List.RemoveNulls(List.ReplaceValue(Text.Split(Source[Column1]{0}, "  "),"",null,Replacer.ReplaceValue)),
    remTopRow = Table.Skip(Source,1),
    replaceSpace = Table.ReplaceValue(remTopRow," ","<>",Replacer.ReplaceText,{"Column1"}),
    replaceGtLt = Table.ReplaceValue(replaceSpace,"><","",Replacer.ReplaceText,{"Column1"}),
    splitByLtGt = Table.SplitColumn(replaceGtLt, "Column1", Splitter.SplitTextByDelimiter("<>", QuoteStyle.Csv), List.Transform({1..List.NonNullCount(columnHeaders)}, each Text.From(_))),
    renCols = Table.RenameColumns(splitByLtGt, List.Zip({Table.ColumnNames(splitByLtGt), columnHeaders}))
in
    renCols

 

Summary:

columnHeaders = Split out the column names that you want to keep from the source into a list - we'll use this later

remTopRow = Now we have our headers, we can just get rid of the first table row to keep things tidy later

replaceSpace & replaceGtLt = Use the old "<>" => "><" replace trick to get down to a singular instance of a delimiter

splitByLtGt = Split columns by our singular "<>" delimiter into as many columns as we need based on the count of columnHeaders

renCols = Overwrite your new column headers with the values in columnHeaders

 

This does assume that none of your fields have spaces in them. If they do, then you can adjust the replaceSpace step to look for a double-space "  ",  and replace with "<><>", but you'll have to trim all your column values before assigning data types to remove any straggling spaces.

 

Pete



Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!




View solution in original post

2 REPLIES 2
BA_Pete
Super User
Super User

Hi @Patryk_PL_92 ,

 

Here's a query that converts this:

BA_Pete_0-1674138913124.png

 

...into this:

BA_Pete_1-1674138945958.png

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WckxOzi/NK1Fwzs8pzc1TgAOXxJJUDMGwxJxSuKghDnEjpVgdoLlOzgr4gJGRvqGRvpGBkRGyqKGBgQFeXUB5sPEurm54jVcwMNE3MMUwX8ECvwWWQKAUGwsA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),

    columnHeaders = List.RemoveNulls(List.ReplaceValue(Text.Split(Source[Column1]{0}, "  "),"",null,Replacer.ReplaceValue)),
    remTopRow = Table.Skip(Source,1),
    replaceSpace = Table.ReplaceValue(remTopRow," ","<>",Replacer.ReplaceText,{"Column1"}),
    replaceGtLt = Table.ReplaceValue(replaceSpace,"><","",Replacer.ReplaceText,{"Column1"}),
    splitByLtGt = Table.SplitColumn(replaceGtLt, "Column1", Splitter.SplitTextByDelimiter("<>", QuoteStyle.Csv), List.Transform({1..List.NonNullCount(columnHeaders)}, each Text.From(_))),
    renCols = Table.RenameColumns(splitByLtGt, List.Zip({Table.ColumnNames(splitByLtGt), columnHeaders}))
in
    renCols

 

Summary:

columnHeaders = Split out the column names that you want to keep from the source into a list - we'll use this later

remTopRow = Now we have our headers, we can just get rid of the first table row to keep things tidy later

replaceSpace & replaceGtLt = Use the old "<>" => "><" replace trick to get down to a singular instance of a delimiter

splitByLtGt = Split columns by our singular "<>" delimiter into as many columns as we need based on the count of columnHeaders

renCols = Overwrite your new column headers with the values in columnHeaders

 

This does assume that none of your fields have spaces in them. If they do, then you can adjust the replaceSpace step to look for a double-space "  ",  and replace with "<><>", but you'll have to trim all your column values before assigning data types to remove any straggling spaces.

 

Pete



Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!




Hi!

thanks for your effort and detailed explanation 🙂

seems that working when you work with .htm documents you need to be very creative and use some tricks to gather any useful information, that set of steps is really nice. It does not solve my problem entirely, but it took me to another stage where I need to do something with data in rows to split into columns correctly

 

For some reason "replace values" (space to anything) is not working now, so I am stuck, also I found another blockers after that, so I give up on that point and will ask to send me files in other format - like .csv/ .xls /.xml etc.

 

I do not recommend anybody to work with .htm documents, it is a nightmare 😮

 

I accept that as a solution, because it solves a part of the problem and it took me futher 🙂

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors