Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
andris_
Resolver I
Resolver I

Power Query: remove all text between delimiters

Hey all,

 

I'm creating a report, which has a SharePoint list datasource. It has like ~5 columns, and it contains memorandums about meetings. There is a description column, which contains the memos of the meetings, like "this and this happened, these are the actions to be made etc". It has paragraphs, and it causes problems. Here's a little sample data below:

query m.PNG

So, basically, I'd like to remove all the HTML elments (paragraphs, line breaks and all other elements). They are all between the "<" and ">" delimiters, and I only need the TEXTs. I found the Text.BetweenDelimiters/AfterDelimiter/BeforeDelimiter functions in M, but I didn't know how to use them in an iterative way.

 

Anyone has any idea how to deal with it?

Regards,

Andris

1 ACCEPTED SOLUTION
Stachu
Community Champion
Community Champion

you can use recursion in PowerQuery
crate a blank query and paste this code into editor, then call this function on your HTML column

(txt as text) =>
[ 
fnRemoveFirstTag = (HTML as text)=>
    let
        OpeningTag = Text.PositionOf(HTML,"<"),
        ClosingTag = Text.PositionOf(HTML,">"),
        Output = 
            if OpeningTag = -1 
            then HTML
            else Text.RemoveRange(HTML,OpeningTag,ClosingTag-OpeningTag+1)
    in
        Output,
fnRemoveHTMLTags = (y as text)=>
    if fnRemoveFirstTag(y) = y
    then y 
    else @fnRemoveHTMLTags(fnRemoveFirstTag(y)),
Output = @fnRemoveHTMLTags(txt) 
][Output]

EDIT - typo in syntax



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

View solution in original post

16 REPLIES 16
chrimaho
Frequent Visitor

There is an even simpler solution.

You can create a new function called fun_ReplaceTextBetweenDelimiters, and in it add this code 👇

let
    fun_ReplaceTextBetweenDelimiters = (Text as text, StartDelimiter as text, EndDelimiter as text, optional ReplaceDelimiters as nullable logical, optional NewText as nullable text, optional TrimResult as nullable logical, optional FixDoubleSpaces as nullable logical) as text => 
        let
            // Add Default Parameters
            Default_ReplaceDelimiters = if ReplaceDelimiters is null then true else ReplaceDelimiters,
            Default_NewText = if NewText is null then "" else NewText,
            Default_TrimResult = if TrimResult is null then true else TrimResult,
            Default_FixDoubleSpaces = if FixDoubleSpaces is null then true else FixDoubleSpaces,
            //Do work
            TextBetweenDelimiters = Text.BetweenDelimiters(Text, StartDelimiter, EndDelimiter),
            TextToReplace = if Default_ReplaceDelimiters then Text.Combine({StartDelimiter,TextBetweenDelimiters,EndDelimiter}) else TextBetweenDelimiters,
            ReplacedText = Text.Replace(Text, TextToReplace, Default_NewText),
            //Clean Result
            TrimmedText = if Default_TrimResult then Text.Trim(ReplacedText) else ReplacedText,
            FixedSpaces = if Default_FixDoubleSpaces then Text.Replace(TrimmedText, "  ", " ") else TrimmedText
        in
            FixedSpaces
in
    fun_ReplaceTextBetweenDelimiters

Then, we can test it like this:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WcnRyVtAw1FTSAbGUYnWilVzd3BU0jEAiQBZYJCIyCqhGRwEsCOQoxcYCAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [TestData = _t, TargetData = _t]),
    ChangeType = Table.TransformColumnTypes(Source,{{"TestData", type text}, {"TargetData", type text}}),
    RunFunction = Table.AddColumn(ChangeType, "NewText", each fun_ReplaceTextBetweenDelimiters([TestData], "(", ")", true), type text),
    TestResult = Table.AddColumn(RunFunction, "Test", each [TargetData]=[NewText], type logical)
in
    TestResult

Input:

TestData TargetData
ABC (1)ABC
EFG (2)EFG
XYZ (1, 2)XYZ

Output:

TestData TargetData NewText Test
ABC (1)ABCABCTRUE
EFG (2)EFGEFGTRUE
XYZ (1, 2)XYZXYZTRUE
vgangabharani
Frequent Visitor

New source->New blank query-> Highlight the fx (to convert it into function)
Function 1 to remove all HTML Tags

(txt as text) =>
[
fnRemoveFirstTag = (HTML as text)=>

let
TextBetween=Text.BetweenDelimiters(HTML,"<",">"),
Output=if Text.Length(TextBetween)>0 then

Text.Replace(HTML,
Text.Insert(
Text.Insert(TextBetween,0,"<"),
Text.Length(TextBetween)+1,">")
,"")
else
HTML
in
Output,
fnRemoveHTMLTags = (y as text)=>
if y<>null then
if fnRemoveFirstTag(y)<>y then
@fnRemoveHTMLTags(fnRemoveFirstTag(y))
else
y
else "",

Output =if txt <> null then @fnRemoveHTMLTags(txt) else ""
][Output]


Function 2 to remove encodings

let
Source = (txt as text) =>
[

fnRemovecodes = (y as text)=>
Text.Replace(Text.Replace(
Text.Replace(Text.Replace(Text.Replace(
Text.Replace(Text.Replace(
Text.Replace(Text.Replace(
Text.Replace(Text.Replace(
Text.Replace(
Text.Replace(y,"&apos;","'"),
"&lt;","<"),
"&gt;",">"),
"&quot;",""""),
"&amp;","&"),
"&nbsp;"," "),
"&copy;","©"),
"&reg;","®"),
"&cent;","¢"),
"&pound;","£"),
"&yen;","¥"),
"&euro;","€"),"&#39;","'"),


Output =if txt <> null then @fnRemovecodes(txt) else ""
][Output]
in
Source


To apply to the column of your table

- Go to the table, clock on the column
- click on Add column-> Invoke custom function

 

Give  aname to new column. Select the function1 and for txt select the column that you need to convert from HTML to text. Then on the new column select function 2 and for txt select the new column from previous step.

 



 

Interkoubess
Solution Sage
Solution Sage

Hi @andris_,

 

I will rather use something like Text. Start(Text.Range etc..)

 

You can make a try with this formula and if it does not work please provide example I can copy easily and make the change ( not printscreen).

Text.Range([Description],Text.PositionOf([Description],"=")+2,Text.PositionOf([Description],"<p>")-Text.PositionOf([Description],"=")-4

Hope it helps...

 

Ninter

Hi @Interkoubess,

 

Thank you for your reply!

 

I think Text.Start can be a solution as well, but my focus is on the iteration part, because as you can see in my sample dataset, there are more than one delimiters in a record. What I'd like to do is to remove all the html elements, and keep/concatenate the TEXT parts.

 

With the following formula, you can have the same result, as yours:

Text.BeforeDelimiter([Description],"<p>")

Tell me if I misunderstood You, but I can't see how can I do it in an iterative way.

 

(Here's some sample data you can use)

IDTopicCustomerDescriptionStatus
1Implementation of XYCustomer 1<div class="ExternalClass6F25214767CC48ADBB6B525F221F78B4"><p>TEXT </p><p>&#160;</p><p>TEXT</p><p>TEXT</p>TEXT<strong>TEXT</strong></li></ul></div>Opened
2Product ManagamentCustomer 2<div class="ExternalClass45BB2C5321C95J49B0F78879862B8310C"><p>TEXT</p><p>TEXT</p><strong>TEXT</strong></p></div>Closed
3Exchange MachinesCustomer 1<div class="ExternalClassEC3F6FC6C1D64D78980B1B5FF615B9860"><p>TEXT</p></div>Opened
4No.: 104085; checking statusCustomer 3<div class="ExternalClassDC1495BEBABE408B9855E7DDF1B1D6AD"><p>TEXT</p><p>TEXTOpened
5Backup machinesCustomer 4<div class="ExternalClass7EB7A021D123446BCA2D83011F9CC301C"><p>TEXT</p><p><br></p><p>TEXT</p><p>TEXT</p></div>Closed

 
Regards,
Andris

 

Hi @andris_,

 

Please retry the last formula.

 

Thank you.

Hey @Interkoubess,

 

This formula (with minor changes) is working, but it will only give you the first text. The problem is, that there are more between in a record, and every record is different (so, like in my sample data, the first record has 5 TEXT parts, the second one 3 TEXT parts etc).

 

What do you think the general solution should be?

 

Regards,

Andris

Hi @andris_,

 

Have you solved your problem?

 

If you have soloved, please always accept the replies making sense as solution to your question so that people who may have the same question can get the solution directly.

 

If you still need help, you could have a reference of this video.

 

Best Regards,

Cherry

Community Support Team _ Cherry Gao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Stachu
Community Champion
Community Champion

you can use recursion in PowerQuery
crate a blank query and paste this code into editor, then call this function on your HTML column

(txt as text) =>
[ 
fnRemoveFirstTag = (HTML as text)=>
    let
        OpeningTag = Text.PositionOf(HTML,"<"),
        ClosingTag = Text.PositionOf(HTML,">"),
        Output = 
            if OpeningTag = -1 
            then HTML
            else Text.RemoveRange(HTML,OpeningTag,ClosingTag-OpeningTag+1)
    in
        Output,
fnRemoveHTMLTags = (y as text)=>
    if fnRemoveFirstTag(y) = y
    then y 
    else @fnRemoveHTMLTags(fnRemoveFirstTag(y)),
Output = @fnRemoveHTMLTags(txt) 
][Output]

EDIT - typo in syntax



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

Hi @Stachu , I'm new to using power query to extract data from sharepoint.

 

I followed your code to remove all the HTML but there is this error msg. Appreciate your help on this.

Gill_0-1667556223128.png

 

Stachu
Community Champion
Community Champion

hi @Gill 

it seems you've put the code into the formula bar rather than an advanced editor:
https://learn.microsoft.com/en-us/power-query/power-query-ui#the-advanced-editor



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

@Stachu ,

I'm new to calling functions on columns so please bear with me. 

The statement " then call this function on your HTML column" confuses me a little bit.   The query solution you provided has been created in my project and is called "RemoveHTML". 

 

The column with the HTML is called "Comments" and the table it is located in is called "Issues". 

 

Do I create a function from the Query first?  If so, the system wants me to create a parameter as it says the Query doesn't include a parameter.  

 

Not sure what my next steps are, any assistance is appreciated. 

 

Patrick

 

@pAAmckelvey in the Issues table you need to create a new column with this syntax:

 

 RemoveHTML([Comments])

 

 



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

Anonymous
Not applicable

I am just getting into functions myself and have been working on a problem to remove HTML tags from within a comments column.

I tried using the Function as you outlined, but I keep getting an error that I am passing a parameter to a functions that expects 0 parameters.

(1) In my steps, I created a blank query and copied your code into it.

(2) I then created a function from the blank query.

(3) Created a Custom Column and added the syntax you outlined in your response. (Probably where I am going wrong.)

(4) It then returns the error I mention above.

Any thoughts on what I might be doing wrong?

Thanks

Joe Mays

 

Stachu
Community Champion
Community Champion

@Anonymous can you create a new post describing specific problem that you have and share the code for all the queries that you use?



Did I answer your question? Mark my post as a solution!
Thank you for the kudos 🙂

Anonymous
Not applicable

@Stachu No problem, I will create a new post and outlined the specific issue I am having. Thanks for your response.

Hey @Stachu,

 

Sorry for not replying until now, but I wasn't able to check it til today. But your solution is awesome, it's working! Thank you! 🙂

 

Best regards,

Andris

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.