Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
BrandonH
Frequent Visitor

Clean Data - Multiple Criteria - Cleaning Strings of Text

Hello Bi Community, 

 

I know what I am trying to accomplish is possible and has been done before. However, I am unable to find an answer specific to my solution requirements. 

 

My data set deals with regulation language. Language regularly looks like:

  • (a) An apple a day keeps the doctor away.
  • (b)(1) Ignorance is bliss.
  • (c)(2)(A) It takes two to tango
  • (d) Please see this list of requirements:
  • (d)(1) It takes two to tango
  • (d)(2) When in doubt pinky out
  • (e) In addition to (d)(1) and (d)(2) please remember to keep hands and feet inside the vehicle at all times.
  • Have a great day.
  • (a)(1)(C) Why does this language have to nest this way.

 

Finding a way to reliably clean the start of the regulation language is important. I would also like to not maintain a transformation table as new reguations are added. I am working within Power Query and Excel at the moment. 

 

I think a situation where I look at the beginning of the regulation text and split by a delimiter or replace with nothing, if any of the conditions are met will be the best solution. I just cannot get the formula working. I go over some of the hurdles below of cleaning the data. Some delimiters I have in mind would be ") " [closing parenthesis and a space], ". " [period and a space]. I have the idea to even count the number of opening and closing parenthesis within the opening 15 characters of the text and if they both total 2 then to remove the part of the text string that is within and apart of the parenthesis.

 

These are ok ideas until all of the one off situations I list below...

 

There are also variations of the "subsection identifier" [(a), (b)(1), (c)(2)(A), etc.]

  • Sometimes there are Roman Numerals used in place of letter number idenifiers
    • (i), (ii), (iii) ... (ix), (x)
  • Sometimes there are numbers with a period used in place of the identifiers
    • 1. or 2. or 3. instead of (1), (2), (3). From "1." to 99."
  • Sometimes there are letters and a period used in splace of the identifiers:
    • a. or b. or c. instead of (a), (b) or (c) from A to Z
  • Sometimes there are upper case or lower case identifiers:
    • (a) or (A)
  • Sometimes there are spaces within the identifier:
    • (a) vs. (a ) [this is a data entry issue]
  • Sometimes there are spaces between identifiers:
    • (b)(1) vs. (b) (1)
  • Sometimes the regulation text has a space between subsection identifiers and the remaining text, sometimes not:
    • (a) An apple a day keeps the doctor away. vs. (a)An apple a day keeps the doctor away.
  • Sometimes a regulation text contains an subsection identifier later in the text. No cleaning would happen here:
    • (e) in addition to (d) please remember to keep hands and feet inside the vehicle at all times.
    • After being cleaned the language would say:  in addition to (d) please remember to keep hands and feet inside the vehicle at all times.
  • Some regulation text does not start with an identifier, no cleaning would occur here:
    • Have a great day.

As I said be before this problem is not unique to me, has probably been addressed before. However, where I understand the idea of the formulas I lack the ability to write them. 

 

I appreciate your time you took in reading this. 

 

1 ACCEPTED SOLUTION
ronrsnfld
Super User
Super User

One way to do this is with Regular Expressions. In Power BI Desktop (or Excel) you can write a custom function that uses Javascript.

 

 

// regexReplace

(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
    in Output

 

 

And then use it in your main query like this:

 

let

//Change next line to reflect actual data source
    Source = Excel.CurrentWorkbook(){[Name="Table16"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),

//Javascript regular expression code
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Cleaned", each fnRegexReplace([Column1], 
    "^(?:(?:\\(([^)]*\\))\\s*)|([a-z0-9](?:[0-9])?[.]\\s*)\\s*)*", "", "img"), type text)
in
    #"Added Custom"

 

 

ronrsnfld_0-1685882202695.png

 

You could also use Python and/or R for your Regex if you are using Power BI Service. Just need to check to make sure your package is supported there.

 

View solution in original post

14 REPLIES 14
ronrsnfld
Super User
Super User

One way to do this is with Regular Expressions. In Power BI Desktop (or Excel) you can write a custom function that uses Javascript.

 

 

// regexReplace

(text as nullable text,pattern as nullable text,replace as nullable text, optional flags as nullable text) =>
    let
        f=if flags = null or flags ="" then "" else flags,
        l1 = List.Transform({text, pattern, replace}, each Text.Replace(_, "\", "\")),
        l2 = List.Transform(l1, each Text.Replace(_, "'", "\'")),
        t = Text.Format("<script>var txt='#{0}';document.write(txt.replace(new RegExp('#{1}','#{3}'),'#{2}'));</script>", List.Combine({l2,{f}})),
        r=Web.Page(t)[Data]{0}[Children]{0}[Children],
        Output=if List.Count(r)>1 then r{1}[Text]{0} else ""
    in Output

 

 

And then use it in your main query like this:

 

let

//Change next line to reflect actual data source
    Source = Excel.CurrentWorkbook(){[Name="Table16"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),

//Javascript regular expression code
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Cleaned", each fnRegexReplace([Column1], 
    "^(?:(?:\\(([^)]*\\))\\s*)|([a-z0-9](?:[0-9])?[.]\\s*)\\s*)*", "", "img"), type text)
in
    #"Added Custom"

 

 

ronrsnfld_0-1685882202695.png

 

You could also use Python and/or R for your Regex if you are using Power BI Service. Just need to check to make sure your package is supported there.

 

Hi there, finally had some time to try your function. I am unable to get it working. It is admittedly probably my lack of understadning and not having used custom fuctions much. 

 

I started with a table to mirror your screen shot above. To make sure I understood what was happening before moving to my working datasets. The function is not cleaning the same way. Would you mind giving me some more granular steps. 

Here is a picture of my output.

Capture.PNG

 

It appears you renamed the function "Cleaned" instead of "fnRegexReplace". Although allowable, I'd think that a bit confusing since that is also the name of the column being added.

 

I also see that you did not copy the code as posted. In particular, you did not include the "Pattern" argument in your function call from your main code block.  You have it as "null". If you examine the code I posted, you will see the Regular Expression pattern. If you omit it, the function will not do anything.

Hi there ronrsnfld, 

 

I have been able to test the regex clean up function on some test data as well as my working dataset. I have looked through approximately 960K rows of cleaned text and see one instance where the regex clean up is removing extra characters. 

In cases where language is something like this:

  • (b) A current edition of a dictionary is preferred.

The cleaned text is:

  • current edition of a dictionary is preferred.

Any language that starts with the "word" 'A' is being treated as a single letter and removed. The words "An", "As" etc aren't being removed.

 

This happening is effecting 300 or so records out of 960K records. 

Is there an update that can be made to the regex cleaning function? I am also looking at creating a custom cleaning rule that says if the language starts with an lower case letter then add "A " to the beginning of the language. 

Once you have a chance to weigh in on this I will be accepting your solution. 🙂 apprecaite it so much! Guess I will need to learn about custom functions more and regex!

 

Minor change in the regex. I have edited my answer to correct that in situ. The corrected regex is:

 

"^(?:(?:\\(([^)]*\\))\\s*)|([a-z0-9](?:[0-9])?[.]\\s*)\\s*)*"

 

 

Hi there, 

 

In addition to your accepted solution. I imported some new data and there are 2 stragglers that arent being cleaned. I have been trying to learn some RegEx in the interim and cannot get the groupings to properly capture the new issues.

 

I have strings of text that are preceeded by a 'tab' the string looks like this:

  • "      (a) You will be home no later than 10PM"

Secondly, I have an issue that I posted about here. I am using your function and RegEx expression to then clean a second column. In this column there is a string of text that looks like this:

  • "Why is this happening."

When in actuality posting the cell text into Notepad it looks like this:  

  • """Why is this happening."""

I created a new post as it is an issue running in parallel to my original ask here. 

 

Any input is appreciated. Take care. 


[When posting this reply and my new post I received the following error:
Your post has been changed because invalid HTML was found in the message body. The invalid HTML has been removed. Please review the message and submit the message when you are satisfied. I am wondering if there is some hidden HTML issue going on? I am out of my depth.]

With regard to your first question, following the "beginning of string character: ^ ", insert an optional group for either a whitespace character or a double quote.  So in the regex you would replace ^ with something like ^(?:[\s"])* or, in javascript,^(?:[\\s"])*

 

The question about seeing single set of double quotes in one program and the doubled sets in another has to do with how different programs interpret text input. What you see in Notepad++ is what is actually stored.

 

The "invalid html" message is specific to this forum and it's posting rules. Perhaps if you post the relevant lines as code you may be able to bypass it, but I've not played around with that "feature" enough to be certain. 

using custom functions per item (instead of for the entire table) is a great idea and makes the code a bit simpler.  

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fZA9b8MwDET/yiGTBRQF2rFbtnbr1iHIQFsXW7AtORKdwP++9EfXAtr49I68y+VUicM5QqZpIAReFvTkVKAd4VOjKUOesryeri9G1656c/hqY8oSGyIU1EMo5Rg3rnp31dkIhUpP0zwT1J7ENu2Md/geKIUopMWYwgyKdEPmfQ6ZI6OWjz96C/xPZ5H46RgRom0814opxH5BmnVHaAI70fugIcX1/6GV6HEIpn2lNXysmVdo7QGdMWUDb6RaRAmeWzkPdqFZS1PIMEDDyKOGT3msXbaZNvNbd9df", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    Cleaned = (t)=>
        let
            s = Text.Split(t,")"),
            s2 = List.ReplaceValue(s,null, null, (current,test,replacement)=> if Text.At(current,0)="(" then Text.Lower(current) & ")" else current & ")"),
            s3 = Text.Combine(s2,""),
            s4 = Text.TrimEnd(Text.Range(s3,Text.PositionOfAny(s3,{"A".."Z"})),")")
        in
        s4,
    #"Added Custom1" = Table.ReplaceValue(Source, each [Column1], each Cleaned([Column1]),Replacer.ReplaceValue,{"Column1"})
in
    #"Added Custom1"

I appreciate your response. From the screen shot your solution looks promising. I will work some this morning to see if I can get that custom regexReplace function piped into my query. I'll hopefully be back to accept the solution. Fingers crossed. 

Here's another version

 

 

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fZA9b8MwDET/yiGTBRQF2rFbtnbr1iHIQFsXW7AtORKdwP++9EfXAtr49I68y+VUicM5QqZpIAReFvTkVKAd4VOjKUOesryeri9G1656c/hqY8oSGyIU1EMo5Rg3rnp31dkIhUpP0zwT1J7ENu2Md/geKIUopMWYwgyKdEPmfQ6ZI6OWjz96C/xPZ5H46RgRom0814opxH5BmnVHaAI70fugIcX1/6GV6HEIpn2lNXysmVdo7QGdMWUDb6RaRAmeWzkPdqFZS1PIMEDDyKOGT3msXbaZNvNbd9df", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Split Column by Delimiter" = Table.SplitColumn(Source, "Column1", Splitter.SplitTextByDelimiter(")", QuoteStyle.Csv), {"A".."H"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Split Column by Delimiter",null, null, (current,test,replacement)=> if Text.At(current,0)="(" then Text.Lower(current) & ")" else current & ")",Table.ColumnNames(#"Split Column by Delimiter")),
    Recombined = Table.CombineColumns(#"Replaced Value1",Table.ColumnNames(#"Split Column by Delimiter"),Combiner.CombineTextByDelimiter(""),"Column1"),
    #"Added Custom" = Table.AddColumn(Recombined, "Custom", each Text.TrimEnd(Text.Range([Column1],Text.PositionOfAny([Column1],{"A".."Z"})),")"))
in
    #"Added Custom"

 

 

lbendlin
Super User
Super User

Why does

 

 

 

(e) in addition to (d)(1) and (d)(2) please remember to keep hands and feet inside the vehicle at all times.

 

not start with a capital "I" after the (e)  ? And what is your expected outcome for this string?

Hi, thanks for checking in. This was a typo on my part as I was creating examples. I will update the OP. 

 

The intended cleaned language here would be:

  • In addition to (d)(1) and (d)(2) please remember to keep hands and feet inside the vehicle at all times. 

In that case you can use the first occurrence of a capital letter as the starting mark for your cleaned string.

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fZA9b8MwDET/yiGTBRQF2rFbtnbr1iHIQFsXW7AtORKdwP++9EfXAtr49I68y+VUicM5QqZpIAReFvTkVKAd4VOjKUOesryeri9G1656c/hqY8oSGyIU1EMo5Rg3rnp31dkIhUpP0zwT1J7ENu2Md/geKIUopMWYwgyKdEPmfQ6ZI6OWjz96C/xPZ5H46RgRom0814opxH5BmnVHaAI70fugIcX1/6GV6HEIpn2lNXysmVdo7QGdMWUDb6RaRAmeWzkPdqFZS1PIMEDDyKOGT3msXbaZNvNbd9df", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each Text.Range([Column1],Text.PositionOfAny([Column1],{"A".."Z"})))
in
    #"Added Custom"

 

That only leaves one scenario (uppercase identifier) unaccounted for.  For that you could add a step to force all identifiers to lowercase first.

There are times where there is langauge that contains a capital letter withen parenthesis. I didnt give a specific example of this so my mention of this situation is somewhat buried in the post. 

 

There are times where we have language that looks like this:

  • (a)(1)(C) Why does this language have to nest this way.

I will look further into splitting by capital letters. That is a great point. One part of the formatting that remains consistent is the upper case letter at the beginning of the language. Thanks.

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors