Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedDon't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.
Hello all,
I'm having trouble replicating some regex functionality using Power Query M.
I have a table with a single column of notes from an ODBC data source. The issue is that due to the weird way some of the data is stored, some of the rows have the data stored as plain text while other rows are stored with the RTF encoding in plain text form.
The table below shows a few example of rows in my table.
Notes |
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Arial;}}\viewkind4\uc1\pard\f0\fs18 (25 pcs.) Lot of Distribution – Includes Mounted Bar with Aluminum\par Double Supply\par Egg-crate.\par - Items in White\par - Blade Supplied as Shown\par} |
Wall Mounted Jumpers |
{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0 Cambria;}{\f1\fnil\fcharset0 Arial;}} {\*\generator Riched20 10.0.19041}\viewkind4\uc1 \pard\widctlpar\f0\fs22 Pieces of Distribution \f1\fs18\par } |
I have a link to download a Power BI file which contains some of the data below.
In order to show the data clearly in my report, I used regex through Python scripting to remove the extra rtf formatting elements. Below is the Python code I used to achieve this, with emphasis on my use of `re.sub(regex, "", text, 0, re.MULTILINE)`:
# 'dataset' holds the input data for this script
import re
import json
regex = r"\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?"
def get_edited_notes(text):
edited_note = re.sub(regex, "", text, 0, re.MULTILINE) #remove anything that matches regex
return edited_note.lstrip()
edited_list = []
for i in range(len(dataset)):
text = dataset.iat[i,0]
edited_note = get_edited_notes(str(text))
edited_list.append(edited_note)
dataset['edited_note'] = edited_list
This is the desired result which works using Python scripting
Notes |
(25 pcs.) Lot of Distribution – Includes Mounted Bar with Aluminum Double Supply Egg-crate. - Items in White - Blade Supplied as Shown |
Wall Mounted Jumpers |
Pieces of Distribution |
Unfortunately, I discovered that Python scripting is not compatible with the On-premises data gateway (it only works for personal and my solution requires that I use a standard gateway):
(This is an idea that has been proposed but has not been implemented to my knowledge).
I'm wondering if there is any way I can perform the text extraction similarly using Power Query M. I am also open to tryingother workarounds including using DAX.
Thanks in advance!
Solved! Go to Solution.
I decided to use Power Query M's `Text.Replace` function to solve this. This is not a very efficient solution and significantly increases the time for filtering the rtf text compared to my initial Python solution.
In the code below, I create a list of the common rtf tags I found, and use M's `List.Accumulate` together with `Text.Replace` to replace all the items in the list from each row.
#"Remove RTF tags" = Table.AddColumn(
Notes,
"Edited_Note",
each
let
note = [Note],
bList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\b", Text.From(_)})), //create a list of \b0 - \b24
fsList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\fs", Text.From(_)})), //create a list of \fs0 - \fs24
redList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\red", Text.From(_)})), //create a list of \red0 - \red255
blueList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\blue", Text.From(_)})), //create a list of \blue0 - \blue255
greenList = List.Transform(
List.Numbers(0, 255),
each Text.Combine({"\green", Text.From(_)})
), //create a list of \green0 - \green255
items = {
"{\rtf1",
"\ansicpg1252",
"\ansi",
"\deff0",
"\deflang1033",
"{\fonttbl",
"{\f0",
"\fnil",
"\fcharset0",
"\viewkind4",
"\uc1",
"\pard",
"\f0",
"\par}",
"\par",
"\nouicompat",
"\froman",
"\fprq2",
" Arial;}}",
"Cambria;}",
"{\f1",
"\f1",
"\widctlpar",
"{\*\generator Riched20 10.0.19041}",
"\qj",
"\nowidctlpar8",
"\lang",
"{\*\generator Riched20",
"#(lf)}",
"\tab",
"{\colortbl ;",
"\red0",
"\green0",
"\blue0",
"\cf1",
"\cf0",
"\fbidis",
"\fswiss",
"Calibri;}",
"\ltrpar",
"Arial;}",
";}",
"\i0",
"\i ",
"\b ",
"\fcharset1 Segoe UI Symbol}"
},
combinedList = List.Combine(
{
items,
List.Reverse(fsList),
List.Reverse(redList),
List.Reverse(blueList),
List.Reverse(greenList),
List.Reverse(bList),
{"\b", "\i"}
}
)
in
List.Accumulate(combinedList, Note, (state, current) => Text.Replace(state, current, ""))
),
#"Change RDbl Quotes" = Table.TransformColumns(
#"Remove RTF tags",
{
{"Edited_Note", each List.Accumulate({"\rdblquote"},_, (state, current) => Text.Replace(state, current,""""))} //replace "\rdblquote with a "
}
),
#"Change LSngl Quotes" = Table.TransformColumns(
#"Change RDbl Quotes",
{
{"Edited_Note", each List.Accumulate({"\rquote"},_, (state, current) => Text.Replace(state, current,"'"))} //replace "\rquote with a '
}
),
#"Trimmed Text1" = Table.TransformColumns(#"Remove RTF tags",{{"Note", Text.Trim, type text}}) //removes whitespaces created at the beginning and end of the text.
I decided to use Power Query M's `Text.Replace` function to solve this. This is not a very efficient solution and significantly increases the time for filtering the rtf text compared to my initial Python solution.
In the code below, I create a list of the common rtf tags I found, and use M's `List.Accumulate` together with `Text.Replace` to replace all the items in the list from each row.
#"Remove RTF tags" = Table.AddColumn(
Notes,
"Edited_Note",
each
let
note = [Note],
bList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\b", Text.From(_)})), //create a list of \b0 - \b24
fsList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\fs", Text.From(_)})), //create a list of \fs0 - \fs24
redList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\red", Text.From(_)})), //create a list of \red0 - \red255
blueList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\blue", Text.From(_)})), //create a list of \blue0 - \blue255
greenList = List.Transform(
List.Numbers(0, 255),
each Text.Combine({"\green", Text.From(_)})
), //create a list of \green0 - \green255
items = {
"{\rtf1",
"\ansicpg1252",
"\ansi",
"\deff0",
"\deflang1033",
"{\fonttbl",
"{\f0",
"\fnil",
"\fcharset0",
"\viewkind4",
"\uc1",
"\pard",
"\f0",
"\par}",
"\par",
"\nouicompat",
"\froman",
"\fprq2",
" Arial;}}",
"Cambria;}",
"{\f1",
"\f1",
"\widctlpar",
"{\*\generator Riched20 10.0.19041}",
"\qj",
"\nowidctlpar8",
"\lang",
"{\*\generator Riched20",
"#(lf)}",
"\tab",
"{\colortbl ;",
"\red0",
"\green0",
"\blue0",
"\cf1",
"\cf0",
"\fbidis",
"\fswiss",
"Calibri;}",
"\ltrpar",
"Arial;}",
";}",
"\i0",
"\i ",
"\b ",
"\fcharset1 Segoe UI Symbol}"
},
combinedList = List.Combine(
{
items,
List.Reverse(fsList),
List.Reverse(redList),
List.Reverse(blueList),
List.Reverse(greenList),
List.Reverse(bList),
{"\b", "\i"}
}
)
in
List.Accumulate(combinedList, Note, (state, current) => Text.Replace(state, current, ""))
),
#"Change RDbl Quotes" = Table.TransformColumns(
#"Remove RTF tags",
{
{"Edited_Note", each List.Accumulate({"\rdblquote"},_, (state, current) => Text.Replace(state, current,""""))} //replace "\rdblquote with a "
}
),
#"Change LSngl Quotes" = Table.TransformColumns(
#"Change RDbl Quotes",
{
{"Edited_Note", each List.Accumulate({"\rquote"},_, (state, current) => Text.Replace(state, current,"'"))} //replace "\rquote with a '
}
),
#"Trimmed Text1" = Table.TransformColumns(#"Remove RTF tags",{{"Note", Text.Trim, type text}}) //removes whitespaces created at the beginning and end of the text.
As for me, you should use re.gsub(), not re.sub(). re.sub() only deletes first occurrence, and re.gsub() deletes "globally" all
You may use https://regex101.com/ for testing you regex
I still don't see why you can't use the Regexp inside of one of the HTML functions?
--Nate
Are you using Power BI desktop? If so you can write a custom function which makes use of the Javascript engine. See powerbi - Regexp in Power Query using JavaScript - Stack Overflow
Thanks for your response! I am working in Power BI desktop but I want to be able to publish to Power BI service and use a standard getway. As a result, using Web.Page would not be a viable option.
I guess you'll have to write something just using the built-in text functions in M. Or possibly use a different tool to convert it first.
Right. I'm relatively new to M, do you have suggestions for this?
M's text functions are similar to other languages, so I guess I'd start by developing the algorithm, and then reviewing the syntax of the various M text functions so you can code it.