Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now

Reply
himynameisjuan
Frequent Visitor

Data Cleansing Questions

Hello. 

When using Power Query to cleanse data from a table, before normalizing it, I wonder how to do the following please:


(a) How is it possible in Power Query for a given column, to perform an action if a cell starts with a number, and a different action if a cell starts with a character? For example when adding a custom column to label the values as "number" or "text".

 

(b) Another question please: how is it possible to split a column based on a custom separator like the following?: "(Comma)(space)(integer)(dot)(space)"?
example:
1. Option A, 2. Option B, 71. Option C
would be split into:
Column1: 1. Option A
Column2: Option B
Column3: Option C

Thank you very much, kind regards

2 ACCEPTED SOLUTIONS
dufoq3
Super User
Super User

Hi @himynameisjuan, answer to question no. 1

 

Result

dufoq3_1-1710269914856.png

let
    Source = #table({"Data"}, {{1}, {"A"}, {25}, {"B"}, {"10"}}),
    v1 = Table.AddColumn(Source, "v1", each if [Data] is number then "Number" else "Text", type text),
    v2 = Table.AddColumn(v1, "v2", each if (try Number.From([Data]) otherwise [Data]) is number then "Number" else "Text", type text)
in
    v2

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

View solution in original post

ronrsnfld
Super User
Super User

I have modified my code to

  •  Split the strings into columns
  • Adjust for any number of spaces
  • The existing code already accomodated any number of digits
let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMtRT8C8oyczPU3DUUTCCc5x0FMwRUs5KsTrRSqYIgWQdBUMTOC8CyLOA86LAiokzF6jRwADOddFRAPIRil2VYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    
//Use regex to split    
    #"Run Python script" = Python.Execute("dataset['Split'] = dataset['Column1'].str.split(r',\s+\d+\.\s+')#(lf)",[dataset=#"Changed Type"]),

//remove unneeded columns and expand the table
    #"Removed Columns" = Table.RemoveColumns(#"Run Python script",{"Name"}),
    #"Expanded Value" = Table.ExpandTableColumn(#"Removed Columns", "Value", {"Split"}),

//Replace single quote with double quote to create a valid Json
    #"Replaced Value" = Table.ReplaceValue(#"Expanded Value","'","""",Replacer.ReplaceText,{"Split"}),

//Convert the Python array to an M List
    #"To List" = Table.TransformColumns(#"Replaced Value",{"Split", Json.Document}),

//Max number of Columns
    numCols = List.Max(List.Transform(#"To List"[Split], each List.Count(_))),
    #"Extracted Values" = Table.TransformColumns(#"To List", {"Split", each Text.Combine(List.Transform(_, Text.From), ";"), type text}),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Extracted Values", "Split", 
        Splitter.SplitTextByDelimiter(";", QuoteStyle.Csv), List.Transform({1..numCols}, each "Split." & Text.From(_)))
in
    #"Split Column by Delimiter"

Source Data

ronrsnfld_0-1710381341947.png

 

Results

ronrsnfld_1-1710381393324.png

 

 

 

View solution in original post

7 REPLIES 7
ronrsnfld
Super User
Super User

I have modified my code to

  •  Split the strings into columns
  • Adjust for any number of spaces
  • The existing code already accomodated any number of digits
let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMtRT8C8oyczPU3DUUTCCc5x0FMwRUs5KsTrRSqYIgWQdBUMTOC8CyLOA86LAiokzF6jRwADOddFRAPIRil2VYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    
//Use regex to split    
    #"Run Python script" = Python.Execute("dataset['Split'] = dataset['Column1'].str.split(r',\s+\d+\.\s+')#(lf)",[dataset=#"Changed Type"]),

//remove unneeded columns and expand the table
    #"Removed Columns" = Table.RemoveColumns(#"Run Python script",{"Name"}),
    #"Expanded Value" = Table.ExpandTableColumn(#"Removed Columns", "Value", {"Split"}),

//Replace single quote with double quote to create a valid Json
    #"Replaced Value" = Table.ReplaceValue(#"Expanded Value","'","""",Replacer.ReplaceText,{"Split"}),

//Convert the Python array to an M List
    #"To List" = Table.TransformColumns(#"Replaced Value",{"Split", Json.Document}),

//Max number of Columns
    numCols = List.Max(List.Transform(#"To List"[Split], each List.Count(_))),
    #"Extracted Values" = Table.TransformColumns(#"To List", {"Split", each Text.Combine(List.Transform(_, Text.From), ";"), type text}),
    #"Split Column by Delimiter" = Table.SplitColumn(#"Extracted Values", "Split", 
        Splitter.SplitTextByDelimiter(";", QuoteStyle.Csv), List.Transform({1..numCols}, each "Split." & Text.From(_)))
in
    #"Split Column by Delimiter"

Source Data

ronrsnfld_0-1710381341947.png

 

Results

ronrsnfld_1-1710381393324.png

 

 

 

ronrsnfld
Super User
Super User

For splitting on your complex delimiter, you can use Regular Expressions.

I've not used Python much in Power Query, so there may be more efficient methods, but this does work on your data:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMtRT8C8oyczPU3DUUTCCc5x0FMwRUs5KsTrRSqYIgWQdBUMTOC8CyLOA86KUYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    
//Use regex to split    
    #"Run Python script" = Python.Execute("dataset['Split'] = dataset['Column1'].str.split(r',\s\d+\.\s*')#(lf)",[dataset=#"Changed Type"]),

//remove unneeded columns and expand the table
    #"Removed Columns" = Table.RemoveColumns(#"Run Python script",{"Name"}),
    #"Expanded Value" = Table.ExpandTableColumn(#"Removed Columns", "Value", {"Split"}),

//Replace single quote with double quote to create a valid Json
    #"Replaced Value" = Table.ReplaceValue(#"Expanded Value","'","""",Replacer.ReplaceText,{"Split"}),

//Convert the Python array to an M List
    #"To List" = Table.TransformColumns(#"Replaced Value",{"Split", Json.Document}),

//Expand the list to new rows (could expand to columns if preferred
    #"Expanded Split" = Table.ExpandListColumn(#"To List", "Split"),
    #"Changed Type1" = Table.TransformColumnTypes(#"Expanded Split",{{"Split", type text}})
in
    #"Changed Type1"

Source

ronrsnfld_0-1710293569660.png

 

Results

ronrsnfld_1-1710293605160.png

 

 

 

himynameisjuan
Frequent Visitor

Hello, thank you very much @dufoq3 @jgeddes for the great responses. 

For (2) It would be nice not to trust in the number of spaces (one or two spaces) or the number of integers (1 or 11 or 111). As you pointed out, the ideal approach would be one that dynamically adapts regardless of the number of spaces or numbers. 

Doesn't Power Query have a means to find and separate into columns numbers from text and vice versa?

Maybe running a Python or R script for (2) would more dynamic?

 

Thank you very much, kind regards

Hi @himynameisjuan, you're welcome.

 

It is possible to do such split also in PQ, but it is useless for sample data you provided.

 

Column1 splitted:

dufoq3_0-1710587451352.png

 

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WSkxKNjQyTklNU4rViVZydHIG8lxc3ZRiYwE=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    Ad_Separated = Table.AddColumn(Source, "Separated", each 
        [ a = Splitter.SplitTextByCharacterTransition((x)=> not List.Contains({"0".."9"}, x), {"0".."9"})([Column1]),
          b = List.Combine(List.Transform(a, Splitter.SplitTextByCharacterTransition({"0".."9"}, (x)=> not List.Contains({"0".."9"}, x)))),
          c = Text.Combine(b, "|")
        ][c], type text),
    Ad_SplitCount = Table.AddColumn(Ad_Separated, "SplitCount", each Text.Length(Text.Select([Separated], "|")) +1, Int64.Type),
    SplitColumnByDelimiter = Table.SplitColumn(Ad_SplitCount, "Separated", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), {"1"..Text.From(List.Max(Ad_SplitCount[SplitCount]))} ),
    RemovedColumns = Table.RemoveColumns(SplitColumnByDelimiter,{"SplitCount"})
in
    RemovedColumns

 


Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

jgeddes
Super User
Super User

For a) you can use Table.TransformColumns and check if the character in the first position is a letter or number. The code below is an example.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTwLyjJzM9TcFSK1YlWcoJxjcBcZxg3GcxVhHFNlGJjAQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    changeDataType = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    Custom1 = Table.TransformColumns(changeDataType, {{"Column1", each if Text.PositionOfAny(_, {"0".."9"}) = 0 then "Starts With A Number" else if (Text.PositionOfAny(_, {"A".."Z", "a".."z"})) = 0 then "Starts With A Letter" else "What does it start with?"}})
in
    Custom1

 

For b) you can split the column into rows by comma delimiter and then extract the text after the second space delimiter for rows that start with a space. The code below is and example.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMtRT8C8oyczPU3DUUTCCc5x0FMwRUs5KsbEA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Changed Type", {{"Column1", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Extracted Text After Delimiter" = Table.TransformColumns(#"Split Column by Delimiter", {{"Column1", each if Text.StartsWith(_, " ") then Text.AfterDelimiter(_, " ", {1, RelativePosition.FromStart}) else _, type text}})
in
    #"Extracted Text After Delimiter"



Did I answer your question? Mark my post as a solution!

Proud to be a Super User!





@himynameisjuan@jgeddes found great solution for 2nd question, but it is not easy to find a pattern. It will be even harder if you add 2 more value i.e.:

1. Option A, 2. Option B, 71. Option C, 100. Option D,  102. Option E

and you would like to have outcome like this:

Column1: 1. Option A
Column2: Option B
Column3: Option C

Column4: 100. Option D

Column5: Option E

 

What I'm trying to tell is that to do this there has to be such pattern. If there is a pattern, than there are usually many ways.


Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

dufoq3
Super User
Super User

Hi @himynameisjuan, answer to question no. 1

 

Result

dufoq3_1-1710269914856.png

let
    Source = #table({"Data"}, {{1}, {"A"}, {25}, {"B"}, {"10"}}),
    v1 = Table.AddColumn(Source, "v1", each if [Data] is number then "Number" else "Text", type text),
    v2 = Table.AddColumn(v1, "v2", each if (try Number.From([Data]) otherwise [Data]) is number then "Number" else "Text", type text)
in
    v2

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

Helpful resources

Announcements
November Carousel

Fabric Community Update - November 2024

Find out what's new and trending in the Fabric Community.

Live Sessions with Fabric DB

Be one of the first to start using Fabric Databases

Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.

Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.

Nov PBI Update Carousel

Power BI Monthly Update - November 2024

Check out the November 2024 Power BI update to learn about new features.