Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi Experts !
I am searching for some tips regarding extracting data from .htm document.
I get reports in .htm files from SAP system and need to extract some amounts for further analysis.
Unfortunately it seems to be very hard to format the data as all columns from .htm document loads to PQ as a single table and each column has a lot of additional spaces in random places, so I can not split to columns even if I try some tricks.
Below I attach a snip what I get after expanding html tables to the deepes level.
In row 1 I got names of columns I want to promote to headers.
Any idea what approach to take to be able to split that into columns like:
CoCd; G/L Acct; Short Text; Crcy; BusA; Balance carryforward etc. ??
I already tried to use "Replace values" and replace all single spaces to "-" (to keep the structure of text where single space should stay) and then replace all double spaces to single space, but it didn't work 😞
Would be grateful for any advice 🙂
Solved! Go to Solution.
Hi @Patryk_PL_92 ,
Here's a query that converts this:
...into this:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WckxOzi/NK1Fwzs8pzc1TgAOXxJJUDMGwxJxSuKghDnEjpVgdoLlOzgr4gJGRvqGRvpGBkRGyqKGBgQFeXUB5sPEurm54jVcwMNE3MMUwX8ECvwWWQKAUGwsA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
columnHeaders = List.RemoveNulls(List.ReplaceValue(Text.Split(Source[Column1]{0}, " "),"",null,Replacer.ReplaceValue)),
remTopRow = Table.Skip(Source,1),
replaceSpace = Table.ReplaceValue(remTopRow," ","<>",Replacer.ReplaceText,{"Column1"}),
replaceGtLt = Table.ReplaceValue(replaceSpace,"><","",Replacer.ReplaceText,{"Column1"}),
splitByLtGt = Table.SplitColumn(replaceGtLt, "Column1", Splitter.SplitTextByDelimiter("<>", QuoteStyle.Csv), List.Transform({1..List.NonNullCount(columnHeaders)}, each Text.From(_))),
renCols = Table.RenameColumns(splitByLtGt, List.Zip({Table.ColumnNames(splitByLtGt), columnHeaders}))
in
renCols
Summary:
columnHeaders = Split out the column names that you want to keep from the source into a list - we'll use this later
remTopRow = Now we have our headers, we can just get rid of the first table row to keep things tidy later
replaceSpace & replaceGtLt = Use the old "<>" => "><" replace trick to get down to a singular instance of a delimiter
splitByLtGt = Split columns by our singular "<>" delimiter into as many columns as we need based on the count of columnHeaders
renCols = Overwrite your new column headers with the values in columnHeaders
This does assume that none of your fields have spaces in them. If they do, then you can adjust the replaceSpace step to look for a double-space " ", and replace with "<><>", but you'll have to trim all your column values before assigning data types to remove any straggling spaces.
Pete
Proud to be a Datanaut!
Hi @Patryk_PL_92 ,
Here's a query that converts this:
...into this:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WckxOzi/NK1Fwzs8pzc1TgAOXxJJUDMGwxJxSuKghDnEjpVgdoLlOzgr4gJGRvqGRvpGBkRGyqKGBgQFeXUB5sPEurm54jVcwMNE3MMUwX8ECvwWWQKAUGwsA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
columnHeaders = List.RemoveNulls(List.ReplaceValue(Text.Split(Source[Column1]{0}, " "),"",null,Replacer.ReplaceValue)),
remTopRow = Table.Skip(Source,1),
replaceSpace = Table.ReplaceValue(remTopRow," ","<>",Replacer.ReplaceText,{"Column1"}),
replaceGtLt = Table.ReplaceValue(replaceSpace,"><","",Replacer.ReplaceText,{"Column1"}),
splitByLtGt = Table.SplitColumn(replaceGtLt, "Column1", Splitter.SplitTextByDelimiter("<>", QuoteStyle.Csv), List.Transform({1..List.NonNullCount(columnHeaders)}, each Text.From(_))),
renCols = Table.RenameColumns(splitByLtGt, List.Zip({Table.ColumnNames(splitByLtGt), columnHeaders}))
in
renCols
Summary:
columnHeaders = Split out the column names that you want to keep from the source into a list - we'll use this later
remTopRow = Now we have our headers, we can just get rid of the first table row to keep things tidy later
replaceSpace & replaceGtLt = Use the old "<>" => "><" replace trick to get down to a singular instance of a delimiter
splitByLtGt = Split columns by our singular "<>" delimiter into as many columns as we need based on the count of columnHeaders
renCols = Overwrite your new column headers with the values in columnHeaders
This does assume that none of your fields have spaces in them. If they do, then you can adjust the replaceSpace step to look for a double-space " ", and replace with "<><>", but you'll have to trim all your column values before assigning data types to remove any straggling spaces.
Pete
Proud to be a Datanaut!
Hi!
thanks for your effort and detailed explanation 🙂
seems that working when you work with .htm documents you need to be very creative and use some tricks to gather any useful information, that set of steps is really nice. It does not solve my problem entirely, but it took me to another stage where I need to do something with data in rows to split into columns correctly
For some reason "replace values" (space to anything) is not working now, so I am stuck, also I found another blockers after that, so I give up on that point and will ask to send me files in other format - like .csv/ .xls /.xml etc.
I do not recommend anybody to work with .htm documents, it is a nightmare 😮
I accept that as a solution, because it solves a part of the problem and it took me futher 🙂
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.