cancel
Showing results for 
Search instead for 
Did you mean: 

Fabric is Generally Available. Browse Fabric Presentations. Work towards your Fabric certification with the Cloud Skills Challenge.

Reply
kdennis
Frequent Visitor

Replicating regex in Power BI

Hello all,

 

I'm having trouble replicating some regex functionality using Power Query M.

 

I have a table with a single column of notes from an ODBC data source. The issue is that due to the weird way some of the data is stored, some of the rows have the data stored as plain text while other rows are stored with the RTF encoding in plain text form.

 

The table below shows a few example of rows in my table.

 

Notes
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Arial;}}\viewkind4\uc1\pard\f0\fs18 (25 pcs.) Lot of Distribution – Includes Mounted Bar with Aluminum\par

Double Supply\par

Egg-crate.\par

     - Items in White\par

     - Blade Supplied as Shown\par}
Wall Mounted Jumpers
{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset0 Cambria;}{\f1\fnil\fcharset0 Arial;}}

{\*\generator Riched20 10.0.19041}\viewkind4\uc1

\pard\widctlpar\f0\fs22 Pieces of Distribution \f1\fs18\par

}

 

I have a link to download a Power BI file which contains some of the data below.

Dropbox Link to Power BI file 

 

In order to show the data clearly in my report, I used regex through Python scripting to remove the extra rtf formatting elements. Below is the Python code I used to achieve this, with emphasis on my use of `re.sub(regex, "", text, 0, re.MULTILINE)`:

 

 

# 'dataset' holds the input data for this script

import re
import json

regex = r"\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?"

def get_edited_notes(text):
    edited_note = re.sub(regex, "", text, 0, re.MULTILINE) #remove anything that matches regex
    return edited_note.lstrip()

edited_list = []


for i in range(len(dataset)):
    text = dataset.iat[i,0]
    edited_note = get_edited_notes(str(text))
    edited_list.append(edited_note)

dataset['edited_note'] = edited_list

 

 

 

This is the desired result which works using Python scripting

Notes
(25 pcs.) Lot of Distribution – Includes Mounted Bar with Aluminum

Double Supply

Egg-crate.

     - Items in White

     - Blade Supplied as Shown
Wall Mounted Jumpers
Pieces of Distribution

 

Unfortunately, I discovered that Python scripting is not compatible with the On-premises data gateway (it only works for personal and my solution requires that I use a standard gateway):

kdennis_0-1695135933695.png

 

(This is an idea that has been proposed but has not been implemented to my knowledge).

 

I'm wondering if there is any way I can perform the text extraction similarly using Power Query M. I am also open to tryingother workarounds including using DAX.

 

Thanks in advance!

1 ACCEPTED SOLUTION
kdennis
Frequent Visitor

I decided to use Power Query M's `Text.Replace` function to solve this. This is not a very efficient solution and significantly increases the time for filtering the rtf text compared to my initial Python solution.

 

In the code below, I create a list of the common rtf tags I found, and use M's `List.Accumulate` together with `Text.Replace` to replace all the items in the list from each row.

 

#"Remove RTF tags" = Table.AddColumn(
        Notes, 
        "Edited_Note", 
        each 
        let
            note = [Note], 
            bList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\b", Text.From(_)})),  //create a list of \b0 - \b24
            fsList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\fs", Text.From(_)})),  //create a list of \fs0 - \fs24
            redList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\red", Text.From(_)})),  //create a list of \red0 - \red255
            blueList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\blue", Text.From(_)})),  //create a list of \blue0 - \blue255
            greenList = List.Transform(
            List.Numbers(0, 255), 
            each Text.Combine({"\green", Text.From(_)})
            ),  //create a list of \green0 - \green255
            items = {
                "{\rtf1", 
                "\ansicpg1252", 
                "\ansi", 
                "\deff0", 
                "\deflang1033", 
                "{\fonttbl", 
                "{\f0", 
                "\fnil", 
                "\fcharset0", 
                "\viewkind4", 
                "\uc1", 
                "\pard", 
                "\f0", 
                "\par}", 
                "\par", 
                "\nouicompat", 
                "\froman", 
                "\fprq2", 
                " Arial;}}", 
                "Cambria;}", 
                "{\f1", 
                "\f1", 
                "\widctlpar", 
                "{\*\generator Riched20 10.0.19041}", 
                "\qj", 
                "\nowidctlpar8", 
                "\lang", 
                "{\*\generator Riched20", 
                "#(lf)}", 
                "\tab", 
                "{\colortbl ;", 
                "\red0", 
                "\green0", 
                "\blue0",
                "\cf1", 
                "\cf0", 
                "\fbidis", 
                "\fswiss", 
                "Calibri;}", 
                "\ltrpar", 
                "Arial;}",  
                ";}", 
                "\i0", 
                "\i ",  
                "\b ",
                "\fcharset1 Segoe UI Symbol}"
            }, 
            combinedList = List.Combine(
            {
                items, 
                List.Reverse(fsList), 
                List.Reverse(redList), 
                List.Reverse(blueList), 
                List.Reverse(greenList), 
                List.Reverse(bList),
                {"\b", "\i"}
            }
            )
        in
            List.Accumulate(combinedList, Note, (state, current) => Text.Replace(state, current, ""))
    ),

    #"Change RDbl Quotes" = Table.TransformColumns(
        #"Remove RTF tags", 
        {
            {"Edited_Note", each List.Accumulate({"\rdblquote"},_, (state, current) => Text.Replace(state, current,""""))} //replace "\rdblquote with a "
        }
    ),

    #"Change LSngl Quotes" = Table.TransformColumns(
        #"Change RDbl Quotes", 
        {
            {"Edited_Note", each List.Accumulate({"\rquote"},_, (state, current) => Text.Replace(state, current,"'"))} //replace "\rquote with a '
        }
    ),
    #"Trimmed Text1" = Table.TransformColumns(#"Remove RTF tags",{{"Note", Text.Trim, type text}}) //removes whitespaces created at the beginning and end of the text.

 

View solution in original post

7 REPLIES 7
kdennis
Frequent Visitor

I decided to use Power Query M's `Text.Replace` function to solve this. This is not a very efficient solution and significantly increases the time for filtering the rtf text compared to my initial Python solution.

 

In the code below, I create a list of the common rtf tags I found, and use M's `List.Accumulate` together with `Text.Replace` to replace all the items in the list from each row.

 

#"Remove RTF tags" = Table.AddColumn(
        Notes, 
        "Edited_Note", 
        each 
        let
            note = [Note], 
            bList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\b", Text.From(_)})),  //create a list of \b0 - \b24
            fsList = List.Transform(List.Numbers(0, 24), each Text.Combine({"\fs", Text.From(_)})),  //create a list of \fs0 - \fs24
            redList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\red", Text.From(_)})),  //create a list of \red0 - \red255
            blueList = List.Transform(List.Numbers(0, 255), each Text.Combine({"\blue", Text.From(_)})),  //create a list of \blue0 - \blue255
            greenList = List.Transform(
            List.Numbers(0, 255), 
            each Text.Combine({"\green", Text.From(_)})
            ),  //create a list of \green0 - \green255
            items = {
                "{\rtf1", 
                "\ansicpg1252", 
                "\ansi", 
                "\deff0", 
                "\deflang1033", 
                "{\fonttbl", 
                "{\f0", 
                "\fnil", 
                "\fcharset0", 
                "\viewkind4", 
                "\uc1", 
                "\pard", 
                "\f0", 
                "\par}", 
                "\par", 
                "\nouicompat", 
                "\froman", 
                "\fprq2", 
                " Arial;}}", 
                "Cambria;}", 
                "{\f1", 
                "\f1", 
                "\widctlpar", 
                "{\*\generator Riched20 10.0.19041}", 
                "\qj", 
                "\nowidctlpar8", 
                "\lang", 
                "{\*\generator Riched20", 
                "#(lf)}", 
                "\tab", 
                "{\colortbl ;", 
                "\red0", 
                "\green0", 
                "\blue0",
                "\cf1", 
                "\cf0", 
                "\fbidis", 
                "\fswiss", 
                "Calibri;}", 
                "\ltrpar", 
                "Arial;}",  
                ";}", 
                "\i0", 
                "\i ",  
                "\b ",
                "\fcharset1 Segoe UI Symbol}"
            }, 
            combinedList = List.Combine(
            {
                items, 
                List.Reverse(fsList), 
                List.Reverse(redList), 
                List.Reverse(blueList), 
                List.Reverse(greenList), 
                List.Reverse(bList),
                {"\b", "\i"}
            }
            )
        in
            List.Accumulate(combinedList, Note, (state, current) => Text.Replace(state, current, ""))
    ),

    #"Change RDbl Quotes" = Table.TransformColumns(
        #"Remove RTF tags", 
        {
            {"Edited_Note", each List.Accumulate({"\rdblquote"},_, (state, current) => Text.Replace(state, current,""""))} //replace "\rdblquote with a "
        }
    ),

    #"Change LSngl Quotes" = Table.TransformColumns(
        #"Change RDbl Quotes", 
        {
            {"Edited_Note", each List.Accumulate({"\rquote"},_, (state, current) => Text.Replace(state, current,"'"))} //replace "\rquote with a '
        }
    ),
    #"Trimmed Text1" = Table.TransformColumns(#"Remove RTF tags",{{"Note", Text.Trim, type text}}) //removes whitespaces created at the beginning and end of the text.

 

watkinnc
Super User
Super User

I still don't see why you can't use the Regexp inside of one of the HTML functions?

 

--Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!
ronrsnfld
Resident Rockstar
Resident Rockstar

Are you using Power BI desktop? If so you can write a custom function which makes use of the Javascript engine. See powerbi - Regexp in Power Query using JavaScript - Stack Overflow

 

Thanks for your response! I am working in Power BI desktop but I want to be able to publish to Power BI service and use a standard getway. As a result, using Web.Page would not be a viable option.

I guess you'll have to write something just using the built-in text functions in M. Or possibly use a different tool to convert it first.

Right. I'm relatively new to M, do you have suggestions for this?

M's text functions are similar to other languages, so I guess I'd start by developing the algorithm, and then reviewing the syntax of the various M text functions so you can code it.

 

Helpful resources

Announcements
PBI November 2023 Update Carousel

Power BI Monthly Update - November 2023

Check out the November 2023 Power BI update to learn about new features.

Community News

Fabric Community News unified experience

Read the latest Fabric Community announcements, including updates on Power BI, Synapse, Data Factory and Data Activator.

Power BI Fabric Summit Carousel

The largest Power BI and Fabric virtual conference

130+ sessions, 130+ speakers, Product managers, MVPs, and experts. All about Power BI and Fabric. Attend online or watch the recordings.

Top Solution Authors