Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now
Hi, I have a column of HTML data that I need to display as text. I need to get rid of all HTML tags and substitute reserved HTML characters. I was trying to recursively remove all HTML tags first. Then I plan to create a table of the most common reserved characters, the values to replace them with, and generate a list to call replace for each row in that table.
I am still new to M but I came across this to replace HTML tags. Except it only removes a single tag, which is not helpful.
But when I try to expand that to a recursive function I'm getting a stack overflow error. I plan to start again to try to learn M, but hoping somebody can help me in the meantime with a quick fix to figure out what I did wrong. Below is my function. Thanks.
let
removeOne = (input) =>
let
text = Text.From(input),
length = Text.Length(text),
position = Text.PositionOf(text, "<"),
positionEnd = Text.PositionOf(text, ">"),
range = positionEnd-position+1,
result = if position >= 0 then Text.ReplaceRange(text, position, range, "") else input
in
result,
removeAll = (input) =>
let
rmvOne = removeOne(input),
rmvAll = if Text.PositionOf(rmvOne, "l") >= 0 then @removeAll(rmvOne) else rmvOne
in
rmvAll,
// Source = get from database,
Source = "<p>hello I am text to replace, lol</p>",
// return = removeAll([columnToPassToFunction])
return = removeAll(Source)
in
return
Solved! Go to Solution.
Just change the "|" into "<"
rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne
Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.
If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>
Hi, I am using this custom function to remove all HTML tags:
(TableToSearch as table, ColumnToSearch as text) =>
Table.TransformColumns(TableToSearch, {ColumnToSearch, each Table.FirstValue(Html.Table(_ ?? "", {{"text",":root"}}))})
The initial idea came from Chris Webb's blog:
https://blog.crossjoin.co.uk/2019/06/09/removing-html-tags-from-text-in-power-query-power-bi/
Coalescing Operator (??) to prevent errors on nulls was an addition courtesy of Ben Gribaudo.
Here is a different approach that uses the built-in Web.Page function.
Below custom function replace special characters such as or $amp; and additionally parses and combines text in HTML tags.
(Text as any) => let
Source = Text,
Html = Web.Page(Source),
resolve = (t as table) =>
let
Result = List.Accumulate(Table.ToRecords(t), {}, (state, current) => List.Combine({state, if current[Text] <> null then {current[Text]} else if current[Children] <> null then @resolve(current[Children]) else {null} }))
in
Result,
Tables = Html[Data]{0},
Text1 = resolve(Tables),
Text2 = List.Select(Text1, each Text.Length(Text.Trim(_)) > 0),
Text3 = Text.Combine(Text2, Character.FromNumber(10))
in
Text3
I've read through this, and still a bit confused.
I just want to add a new text column based upon an HTML-based column and remove the HTML tags as described. I'm not certain how to add a custom function.
Can this all be done in one step, within the creation of the new column ?
My new column name is: NoteText which is based upon the column NoteHTML
Thank you
Add a function just like you would a standard query.
1. Create a new blank query. Name it "HTML Cleaner".
2. In the Advanced Editor, paste the following:
(HTML as text) =>
let
Source = Text.From(HTML),
SplitAny = Text.SplitAny(Source,"<>"),
ListAlternate = List.Alternate(SplitAny,1,1,1),
ListSelect = List.Select(ListAlternate, each _<>" "),
TextCombine = Text.Combine(ListSelect, " ")
in
TextCombine
3. Close the Editor.
4. In your primary query, open the Advanced Editor.
5. Add the following step:
#"HTML Cleanup" = Table.AddColumn(#"Previous Step", "NoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML])),
Good luck.
Thank you, but unfortunately, no luck. I get over 4000 errors.
This is the syntax I have:
let
Source = OData.Feed("https://xyzasdf1111.api.crm.dynamics.com/api/data/v9.1", null, [Implementation="2.0"]),
new_arcollectioncalls_table = Source{[Name="new_arcollectioncalls",Signature="table"]}[Data],
#"Expanded new_currentowner" = Table.ExpandRecordColumn(new_arcollectioncalls_table, "new_currentowner", {"fullname"}, {"new_currentowner.fullname"}),
#"Expanded new_BillingMgr" = Table.ExpandRecordColumn(#"Expanded new_currentowner", "new_BillingMgr", {"fullname"}, {"new_BillingMgr.fullname"}),
#"Expanded new_arcollectioncall_Annotations" = Table.ExpandTableColumn(#"Expanded new_BillingMgr", "new_arcollectioncall_Annotations", {"createdon", "_ownerid_value", "notetext"}, {"new_arcollectioncall_Annotations.createdon", "new_arcollectioncall_Annotations._ownerid_value", "new_arcollectioncall_Annotations.notetext"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded new_arcollectioncall_Annotations",{"new_arcollectioncall_Annotations._ownerid_value"}),
#"Split Column by Delimiter" = Table.SplitColumn(Table.TransformColumnTypes(#"Removed Columns", {{"new_arcollectioncall_Annotations.createdon", type text}}, "en-US"), "new_arcollectioncall_Annotations.createdon", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, false), {"new_arcollectioncall_Annotations.createdon.1", "new_arcollectioncall_Annotations.createdon.2"}),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter",{{"new_arcollectioncall_Annotations.createdon.1", type date}, {"new_arcollectioncall_Annotations.createdon.2", type time}}),
#"Removed Columns1" = Table.RemoveColumns(#"Changed Type",{"new_arcollectioncall_Annotations.createdon.2"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns1",{{"new_arcollectioncall_Annotations.createdon.1", type date}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type1", each ([new_status] = true)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"new_legal", "_new_currentowner_value", "_ownerid_value", "new_collectionnotes", "new_collections", "_createdonbehalfby_value", "new_customer_number", "createdon", "owninguser", "owningteam", "ownerid", "new_arcollectioncall_Annotations.createdon.1", "new_arcollectioncall_Annotations.notetext", "new_BillingMgr.fullname", "new_currentowner.fullname"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns",{{"new_arcollectioncall_Annotations.notetext", "NoteHTML"}}),
#"HTML Cleanup" = Table.AddColumn(#"Renamed Columns", "PlainNoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML]))
in
#"HTML Cleanup"
And the "Blank Query" Definition:
"(HTML as text) =>#(cr)#(lf)let#(cr)#(lf)Source = Text.From(HTML),#(cr)#(lf)SplitAny = Text.SplitAny(Source,""<>""),#(cr)#(lf)ListAlternate = List.Alternate(SplitAny,1,1,1),#(cr)#(lf)ListSelect = List.Select(ListAlternate, each _<>"" ""),#(cr)#(lf)TextCombine = Text.Combine(ListSelect, "" "")#(cr)#(lf)in#(cr)#(lf)TextCombine"
Unfortunately, without seeing the errors, and the data that generates them, I won't be able to help much further. Perhaps you could create a 'dummy' table in the query, to include data rows that generate errors, and rows that don't. Use the following code:
let
Table = #table( type table
[
#"NoteHTML" = text
],
{
{"Data that throws an error"},
{"Data that throws an error"},
{null},
{"Data that doesn't"}
}
),
#"HTML Cleanup" = Table.AddColumn(Table, "PlainNoteText", each
if [NoteHTML] = null
then null
else #"HTML Cleaner"([NoteHTML]), type text)
in
#"HTML Cleanup"
Hello,
I have the same problem as you but i'm not sure where should i copy and paste that function. I have a table with multiple columns and i just need to clean the data of HTML TAGS in one of them could you please explain in more details how you did it
Kind regards,
Nicolas
Just change the "|" into "<"
rmvAll = if Text.PositionOf(rmvOne, "<") >= 0 then @removeAll(rmvOne) else rmvOne
Otherwise you may get unexpected results if the string contains < or > that are not part of a HTML tag.
If you are interested, I can share my code that only removes tag pairs, i.e. </...> strings preceded by the same without /: <...>
Shoot, sorry thanks. I was doing a lot of testing and started with a recursive function for removing 'L'. Will give the full code a shot now. Also, good point about tag pairs. If you don't mind, would love to look through your code to improve this.
Here is my code. It should be called with StartPosition 0. On each iteration, the code examines the string from that startposition for code tag pairs, removes them if found and call the next iteration with a new StartPosition, so the code is also able to handle the situation that the string includes an end tag without corresponding start tag, or </ without corresponding >.
Notice that I also included a mechanism to stop after a maximum number of iterations, which is especiallly useful during development, to prevent endless iterations (until stack overflow).
fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text => let StringFromStartposition = Text.RemoveRange(String, 0, StartPosition), StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"), PositionsEndTag = if StartPositionEndTag = -1 then -1 else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"), StartTag = if PositionsEndTag = -1 then null else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1), StartPositionStartTag = if PositionsEndTag = -1 then -1 else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last), NewString = if StartPositionStartTag = -1 then String else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag), NextStartPosition = if PositionsEndTag = -1 then -1 else if StartPositionStartTag = -1 then StartPosition + StartPositionEndTag + 1 else StartPosition + StartPositionEndTag - PositionsEndTag, Result = if NextStartPosition = -1 then NewString else if Iteration = null then @fnRHTMLT(NewString, NextStartPosition) else if Iteration = MaxIterations then NewString else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations) in Result
I tried incorporating your code to clean up the HTML in only one of my columns. But I get the following error: "Expresion.Error: 1 arguments were passed to function which expects between 2 and 4.
Details:
Pattern=
Arguments=List"
This is the HTML I am trying to remove from the [Notes] column:
Before:
<div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">Request includes: Status indicator for project costs and job progress, Email notification to supervisors of outstanding/incomplete items.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">TSR has been submitted.  Waiting on TEC.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">SWS clean-up and data update in progress in preparation of system modifications.</div><div class="ExternalClass4B5FF8D025D448A5904CF90969BB5E7E">System cleaned up and ready to work with TEC on revamp.<br></div>
After:
Request includes: Status indicator for project costs and job progress, Email notification to supervisors of outstanding/incomplete items.TSR has been submitted.  Waiting on TEC. SWS clean-up and data update in progress in preparation of system modifications. System cleaned up and ready to work with TEC on revamp.
Here is the code:
let Source = SharePoint.Tables("https://orgname.sharepoint.com/sites/department/", [ApiVersion = 15]), #"2be78719-1e12-4827-8f88-d9edd1a7781f" = Source{[Id="2be78719-1e12-4827-8f88-d9edd1a7781f"]}[Items], #"Renamed Columns" = Table.RenameColumns(#"2be78719-1e12-4827-8f88-d9edd1a7781f",{{"ID", "ID.1"}}), #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"End Date", type date}, {"Start Date", type date}, {"Today", type date}, {"End Date (Actu", type date}, {"Created", type date}, {"Modified", type date}, {"EndDateT", type text}, {"Completion", Int64.Type}, {"Id", Int64.Type}, {"ID.1", Int64.Type}, {"EditorId", Int64.Type}}), fnRHTMLT = (String as text, StartPosition as number, optional Iteration as number, optional MaxIterations as number) as text => let StringFromStartposition = Text.RemoveRange(String, 0, StartPosition), StartPositionEndTag = Text.PositionOf(StringFromStartposition, "</"), PositionsEndTag = if StartPositionEndTag = -1 then -1 else Text.PositionOf(Text.RemoveRange(StringFromStartposition, 0, StartPositionEndTag),">"), StartTag = if PositionsEndTag = -1 then null else "<" & Text.Range(StringFromStartposition, StartPositionEndTag + 2, PositionsEndTag - 1), StartPositionStartTag = if PositionsEndTag = -1 then -1 else Text.PositionOf(Text.Start(String,StartPosition + StartPositionEndTag),StartTag,Occurrence.Last), NewString = if StartPositionStartTag = -1 then String else Text.RemoveRange(Text.RemoveRange(String,StartPosition + StartPositionEndTag, PositionsEndTag + 1),StartPositionStartTag, PositionsEndTag), NextStartPosition = if PositionsEndTag = -1 then -1 else if StartPositionStartTag = -1 then StartPosition + StartPositionEndTag + 1 else StartPosition + StartPositionEndTag - PositionsEndTag, Result = if NextStartPosition = -1 then NewString else if Iteration = null then @fnRHTMLT(NewString, NextStartPosition) else if Iteration = MaxIterations then NewString else @fnRHTMLT(NewString, NextStartPosition, Iteration + 1, MaxIterations) in Result, #"RemoveHTML" = Table.TransformColumns(#"Changed Type",{{"Notes", fnRHTMLT, type text}}), in RemoveHTML
This solution is capable of removing tags like
<p> </p>
but not something like
<div class="..."> </div>
An example of function call would be welcome. I don't understand what the meaning of "Iteration".
Is it the amount of tags you want to remove?
My solution to clean HTML tags
let TextFromHtml = (HTML as any) => let Source = if HTML = null then "" else Text.From(HTML), SplitAny = Text.SplitAny(Source,"<>"), ListAlternate = List.Alternate(SplitAny,1,1,1), ListSelect = List.Select(ListAlternate, each _<>""), TextCombine = Text.Combine(ListSelect, "") in TextCombine in TextFromHtml
For beginners: create a blank query an copy paste the code above; rename the query as "TextFromHtml"
Example of calling this function on a column called Comment:
= Table.TransformColumns(#"Previous Step",{{"Comment", TextFromHtml, type text}})
Hello. I tried your query, but its giving me error as below:
Expression.Error: We cannot convert the value "<div>The business ca..." to type Table.
Details:
Value=<div>The business case was delayed, as the proposals were not submitted by the vendors ontime</div>
Type=[Type].
This is the custom function : NoteText= Table.AddColumn(#"Changed Type4", "NoteText", each Table.TransformColumns([#"Notes/Comments"],{{[#"Notes/Comments - Copy"], TextFromHtml, type text}})) . Kindly help as my data is as below which varies with every comment :
<div class="ExternalClassAEC9E21CD9CB4D5099FAB2EBE2BC4A87"><div style="font-family:Calibri, Arial, Helvetica, sans-serif;font-size:11pt;color:rgb(0, 0, 0);">Final sign off confirmation received today from BRM. Attached the confirmation email.</div></div>
<div class="ExternalClass82C655AC166D45B68A0237698583C179"><div style="font-family:Calibri, Arial, Helvetica, sans-serif;font-size:11pt;color:rgb(0, 0, 0);">PQD provided the focal point list yesterday. Meeting to be scheduled as per the Business availability to revalidate the BRD.  </div></div>
Comments are as above for reference. I need the actual text from these comments.
Thanks, |
Thanks Grumelo, your response solved my issue!
Just tried your code, but I got the error "The name 'Iteration' wasn't recognized. Make sure it's spelled correctly." Any suggestions?
Probably you didn't copy my code correctly.
Otherwise: the code must be preceded by an additional "let", and "in fnRHTMLT" must be added at the end:
let <code block: see above> in fnRHTMLT
OK, thanks, that worked. But it doesn't strip all HTML. It cleaned <b>, </b>, <ol>, </ol>, and a few other one- and two-character strings, but it overlooked <br>, </a>, </span>, and long <span...........>, <table...........> and <img........> strings.
I don't see anything in the code that would filter some one- or two-character strings but not others. How can I adapt the code to find strings of any length?
The code doesn't look at the lengths, in other words, lengths don't matter.
The code searches for pairs of tags like <p> and </p>
In general: a start tag without slash and an end tag with slash /.
Any tags that don't come in such pairs are not removed.
You might consider a solution where anything between "<" and ">" is removed (including these delimiters), but the risk is that such characters are removed that are not part of a HTML tag.
You can find an example on Technet.
Hello,
Very new to Power BI and trying to use the solution posted above but I'm not sure which bits to replace and exactly what to replace them with? I have pasted the query into the advanced editor and have tried replacing 'string as text' with my column name but it's not working.
Can you please help by explaining?
Thank you,
Max
Check out the October 2024 Power BI update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
113 | |
91 | |
86 | |
76 | |
66 |
User | Count |
---|---|
149 | |
117 | |
111 | |
106 | |
95 |