Solved: Re: Replicating regex in Power BI

kdennis · ‎09-19-2023

Hello all,

I'm having trouble replicating some regex functionality using Power Query M.

I have a table with a single column of notes from an ODBC data source. The issue is that due to the weird way some of the data is stored, some of the rows have the data stored as plain text while other rows are stored with the RTF encoding in plain text form.

The table below shows a few example of rows in my table.

Notes

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Arial;}}\viewkind4\uc1\pard\f0\fs18 (25 pcs.) Lot of Distribution – Includes Mounted Bar with Aluminum\par

Double Supply\par

Egg-crate.\par

- Items in White\par

- Blade Supplied as Shown\par}

Wall Mounted Jumpers

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0 Cambria;}{\f1\fnil\fcharset0 Arial;}}

{\*\generator Riched20 10.0.19041}\viewkind4\uc1

\pard\widctlpar\f0\fs22 Pieces of Distribution \f1\fs18\par

}

I have a link to download a Power BI file which contains some of the data below.

Dropbox Link to Power BI file

In order to show the data clearly in my report, I used regex through Python scripting to remove the extra rtf formatting elements. Below is the Python code I used to achieve this, with emphasis on my use of `re.sub(regex, "", text, 0, re.MULTILINE)`:

# 'dataset' holds the input data for this script

import re
import json

regex = r"\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?"

def get_edited_notes(text):
    edited_note = re.sub(regex, "", text, 0, re.MULTILINE) #remove anything that matches regex
    return edited_note.lstrip()

edited_list = []


for i in range(len(dataset)):
    text = dataset.iat[i,0]
    edited_note = get_edited_notes(str(text))
    edited_list.append(edited_note)

dataset['edited_note'] = edited_list

This is the desired result which works using Python scripting

Notes

(25 pcs.) Lot of Distribution – Includes Mounted Bar with Aluminum

Double Supply

Egg-crate.

- Items in White

- Blade Supplied as Shown

Wall Mounted Jumpers

Pieces of Distribution

Unfortunately, I discovered that Python scripting is not compatible with the On-premises data gateway (it only works for personal and my solution requires that I use a standard gateway):

(This is an idea that has been proposed but has not been implemented to my knowledge).

I'm wondering if there is any way I can perform the text extraction similarly using Power Query M. I am also open to tryingother workarounds including using DAX.

Thanks in advance!

kdennis · ‎09-28-2023

I decided to use Power Query M's `Text.Replace` function to solve this. This is not a very efficient solution and significantly increases the time for filtering the rtf text compared to my initial Python solution.

In the code below, I create a list of the common rtf tags I found, and use M's `List.Accumulate` together with `Text.Replace` to replace all the items in the list from each row.

#"Remove RTF tags" = Table.AddColumn(
        Notes, 
        "Edited_Note", 
        each 
        let
            note = [Note], 
            bList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\b", Text.From(_)})),  //create a list of \b0 - \b24
            fsList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\fs", Text.From(_)})),  //create a list of \fs0 - \fs24
            redList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\red", Text.From(_)})),  //create a list of \red0 - \red255
            blueList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\blue", Text.From(_)})),  //create a list of \blue0 - \blue255
            greenList = List.Transform(
            List.Numbers(0, 255), 
            each Text.Combine({"\green", Text.From(_)})
            ),  //create a list of \green0 - \green255
            items = {
                "{\rtf1", 
                "\ansicpg1252", 
                "\ansi", 
                "\deff0", 
                "\deflang1033", 
                "{\fonttbl", 
                "{\f0", 
                "\fnil", 
                "\fcharset0", 
                "\viewkind4", 
                "\uc1", 
                "\pard", 
                "\f0", 
                "\par}", 
                "\par", 
                "\nouicompat", 
                "\froman", 
                "\fprq2", 
                " Arial;}}", 
                "Cambria;}", 
                "{\f1", 
                "\f1", 
                "\widctlpar", 
                "{\*\generator Riched20 10.0.19041}", 
                "\qj", 
                "\nowidctlpar8", 
                "\lang", 
                "{\*\generator Riched20", 
                "#(lf)}", 
                "\tab", 
                "{\colortbl ;", 
                "\red0", 
                "\green0", 
                "\blue0",
                "\cf1", 
                "\cf0", 
                "\fbidis", 
                "\fswiss", 
                "Calibri;}", 
                "\ltrpar", 
                "Arial;}",  
                ";}", 
                "\i0", 
                "\i ",  
                "\b ",
                "\fcharset1 Segoe UI Symbol}"
            }, 
            combinedList = List.Combine(
            {
                items, 
                List.Reverse(fsList), 
                List.Reverse(redList), 
                List.Reverse(blueList), 
                List.Reverse(greenList), 
                List.Reverse(bList),
                {"\b", "\i"}
            }
            )
        in
            List.Accumulate(combinedList, Note, (state, current) => Text.Replace(state, current, ""))
    ),

    #"Change RDbl Quotes" = Table.TransformColumns(
        #"Remove RTF tags", 
        {
            {"Edited_Note", each List.Accumulate({"\rdblquote"},_, (state, current) => Text.Replace(state, current,""""))} //replace "\rdblquote with a "
        }
    ),

    #"Change LSngl Quotes" = Table.TransformColumns(
        #"Change RDbl Quotes", 
        {
            {"Edited_Note", each List.Accumulate({"\rquote"},_, (state, current) => Text.Replace(state, current,"'"))} //replace "\rquote with a '
        }
    ),
    #"Trimmed Text1" = Table.TransformColumns(#"Remove RTF tags",{{"Note", Text.Trim, type text}}) //removes whitespaces created at the beginning and end of the text.

View solution in original post

kdennis · ‎09-28-2023

I decided to use Power Query M's `Text.Replace` function to solve this. This is not a very efficient solution and significantly increases the time for filtering the rtf text compared to my initial Python solution.

In the code below, I create a list of the common rtf tags I found, and use M's `List.Accumulate` together with `Text.Replace` to replace all the items in the list from each row.

#"Remove RTF tags" = Table.AddColumn(
        Notes, 
        "Edited_Note", 
        each 
        let
            note = [Note], 
            bList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\b", Text.From(_)})),  //create a list of \b0 - \b24
            fsList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\fs", Text.From(_)})),  //create a list of \fs0 - \fs24
            redList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\red", Text.From(_)})),  //create a list of \red0 - \red255
            blueList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\blue", Text.From(_)})),  //create a list of \blue0 - \blue255
            greenList = List.Transform(
            List.Numbers(0, 255), 
            each Text.Combine({"\green", Text.From(_)})
            ),  //create a list of \green0 - \green255
            items = {
                "{\rtf1", 
                "\ansicpg1252", 
                "\ansi", 
                "\deff0", 
                "\deflang1033", 
                "{\fonttbl", 
                "{\f0", 
                "\fnil", 
                "\fcharset0", 
                "\viewkind4", 
                "\uc1", 
                "\pard", 
                "\f0", 
                "\par}", 
                "\par", 
                "\nouicompat", 
                "\froman", 
                "\fprq2", 
                " Arial;}}", 
                "Cambria;}", 
                "{\f1", 
                "\f1", 
                "\widctlpar", 
                "{\*\generator Riched20 10.0.19041}", 
                "\qj", 
                "\nowidctlpar8", 
                "\lang", 
                "{\*\generator Riched20", 
                "#(lf)}", 
                "\tab", 
                "{\colortbl ;", 
                "\red0", 
                "\green0", 
                "\blue0",
                "\cf1", 
                "\cf0", 
                "\fbidis", 
                "\fswiss", 
                "Calibri;}", 
                "\ltrpar", 
                "Arial;}",  
                ";}", 
                "\i0", 
                "\i ",  
                "\b ",
                "\fcharset1 Segoe UI Symbol}"
            }, 
            combinedList = List.Combine(
            {
                items, 
                List.Reverse(fsList), 
                List.Reverse(redList), 
                List.Reverse(blueList), 
                List.Reverse(greenList), 
                List.Reverse(bList),
                {"\b", "\i"}
            }
            )
        in
            List.Accumulate(combinedList, Note, (state, current) => Text.Replace(state, current, ""))
    ),

    #"Change RDbl Quotes" = Table.TransformColumns(
        #"Remove RTF tags", 
        {
            {"Edited_Note", each List.Accumulate({"\rdblquote"},_, (state, current) => Text.Replace(state, current,""""))} //replace "\rdblquote with a "
        }
    ),

    #"Change LSngl Quotes" = Table.TransformColumns(
        #"Change RDbl Quotes", 
        {
            {"Edited_Note", each List.Accumulate({"\rquote"},_, (state, current) => Text.Replace(state, current,"'"))} //replace "\rquote with a '
        }
    ),
    #"Trimmed Text1" = Table.TransformColumns(#"Remove RTF tags",{{"Note", Text.Trim, type text}}) //removes whitespaces created at the beginning and end of the text.

archery · ‎01-21-2024

As for me, you should use re.gsub(), not re.sub(). re.sub() only deletes first occurrence, and re.gsub() deletes "globally" all

You may use https://regex101.com/ for testing you regex

Anonymous · ‎09-19-2023

I still don't see why you can't use the Regexp inside of one of the HTML functions?

--Nate

ronrsnfld · ‎09-19-2023

Are you using Power BI desktop? If so you can write a custom function which makes use of the Javascript engine. See powerbi - Regexp in Power Query using JavaScript - Stack Overflow

kdennis · ‎09-19-2023

Thanks for your response! I am working in Power BI desktop but I want to be able to publish to Power BI service and use a standard getway. As a result, using Web.Page would not be a viable option.

ronrsnfld · ‎09-19-2023

I guess you'll have to write something just using the built-in text functions in M. Or possibly use a different tool to convert it first.

kdennis · ‎09-19-2023

Right. I'm relatively new to M, do you have suggestions for this?

ronrsnfld · ‎09-20-2023

M's text functions are similar to other languages, so I guess I'd start by developing the algorithm, and then reviewing the syntax of the various M text functions so you can code it.

Replicating regex in Power BI

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Replicating regex in Power BI

Helpful resources

Join our Fabric User Panel

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025