Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

ibarrau

[PowerQuery] Transform column with custom condition

Before we begin, I’d like to clarify that this article will showcase advanced Power Query techniques to be used as ETL in response to data processing needs. This doesn’t negate the fact that there are best practices. Nothing surpasses processing data in a single true source like a warehouse or lakehouse, which would be the best practice. 

 

How many times have you had to create a custom column in Power Query because there wasn’t a way in the interface to replace values with a simple if statement?

 

The question may seem a bit long, but it certainly happens that sometimes we need to clean a dataset with a numerically incorrect column, and we end up performing many “Replace Values” operations. The fact is, replacing values only changes one text string for another. This works fine for small things like traditional typing errors. However, with a more complicated condition—say, if we want a specific text to appear after finding a certain string without replacing it—it becomes more complex. For example, if we find the text “hombr,” it should be replaced with “Varón.” If it appears as homb, hombre, hombrrrre, hombre pues, hombrecito, hombreton, or something similar, it would be recognized as hombre and we’d replace it with “Varón.”

We will look at three examples of replacement and cleaning.

 

Suppose we sent a salary survey to a group of remote workers. We have a table with salaries and a description that not everyone fills out regarding payment in dollars.

ibarrau_0-1724184911979.png

The first thing we might think of is adding another cleaner column, but we could clean this one instead. What if we replace “USD” when the word “dólares” is found, and “ARS” when it is not?

Let’s look at the process. When we want to replace values based on a condition in a column, we need to use the Table.ReplaceValue function. Let’s review the theory:

Table.ReplaceValue(table as table, oldValue as any, newValue as any, replacer as function, columnsToSearch as list) as table

 

This function allows us to replace an oldValue string (in this case, the same value from the column because we want to replace each occurrence regardless of its value) with a newValue (the result of a condition we set) in a classic replacement condition Replacer, and the columnsToSearch in which it will search for the oldValue to change it to the newValue. Based on the previous table, it would look something like this: 

 

= Table.ReplaceValue(
    #"Paso Anterior",
    each [#"Pagos en dólares"] ,
    each if Text.Contains([#"Pagos en dólares"], "dólares")
        then "ARS"
        else "USD",
    Replacer.ReplaceValue,{"Pagos en dólares"}
) 

 

In this way, with each occurrence of the column's own value, it performs a replacement based on the chosen condition. This leaves us with a poor but initial result:

ibarrau_1-1724185051009.png

The logic was fulfilled, and we learned how to perform a replacement in the column. However, the cleaning was somewhat poor and did not account for the scenarios correctly. Remember that we have many nulls, and there are also cases where it says “Part of the salary in dollars,” which would not be covered by ARS and USD.

To improve our cleaning of the column, we will implement a condition between the two columns. From living in this country, I know that there is no full-time remote job (FullTime) with a salary less than 15,000 ARS in the industry, and it would also be difficult for someone to have a monthly salary of 15,000 USD. Therefore, I will use this knowledge to clean with a numeric condition as the first choice and then check for the word “part” when the salary is given in both currencies.

Let’s look at the case:

 

= Table.ReplaceValue(
    #"Renamed Columns",
    each [#"Pagos en dólares"],
    each if [Salario]< 15000
        then "USD"
        else if Text.Contains([#"Pagos en dólares"], "parte") and [#"Pagos en dólares"] <> null  
            then "Híbrido"
            else "ARS",
    Replacer.ReplaceValue,
    {"Pagos en dólares"}
) 

 

We create the first condition as a filter: if the number is less than 15,000, then it should be USD. For the second condition in the if statement, we will look for the word “part” and also ensure it’s not null because otherwise, Text.Contains ignores nulls, and those rows would remain null even though there is an else condition.

In this way, the replacement will be more complete, leaving our column with all possible options:

ibarrau_2-1724185147548.png

 

Write an "IN" SQL condition with Power Query

The previous cases are quite classic, but what happens when we have something more complex? When we need a series of massive replacements similar to SQL’s “IN” clause. For example, let’s look at the following image and say we need to replace all occurrences of text strings that refer to Hombre, Varón, Macho, and Masculino. Something like hombr, var, mach, masc. If we find anything that matches this, then we should replace it with “Varón.”

ibarrau_3-1724185219042.png

The condition similar to SQL’s IN is constructed with List.Contains. Let’s review the theory:

List.Contains(list as list, value as any, optional equationCriteria as any) as logical

 

The function requires a list of values to compare against and the value to be checked. In other words, if any string in the list matches the value, then it returns true.

To perform this complex task, we need two operations. On one hand, we need to build the list of values to replace in each value of our column since mass comparators do exact value comparisons. In other words, the replacement executed would look like:

List.Contains({”hombre”, “Hombre”, “Hombre.”, “Macho lomo plateado”, “Masculino”, “[entre otros….]”} , [#”Me identifico (Género)”])

 

We need to construct this list in such a way that for each match, we can replace it with “Varón.”

In our query editor, we will create a variable. A variable is simply another step that is not related to the “previous step” and exists in our script for us to use. The creation of the list would involve filtering the table for unique values when the text contains the desired terms and converting it to a list. Let’s see:

 

Lista_de_varones = Table.ToList(
    Table.SelectRows(
       Table.Distinct(#"Paso Origen"[[#"Me identifico (género)"]]),
       each (Text.Contains(Text.Lower([#"Me identifico (género)"]), "hombr")
            or Text.Contains(Text.Lower([#"Me identifico (género)"]), "var")
            or Text.Contains(Text.Lower([#"Me identifico (género)"]), "mach")
            or Text.Contains(Text.Lower([#"Me identifico (género)"]), "masc")
           ) and (not Text.Contains(Text.Lower([#"Me identifico (género)"]), "trans")
        )
    )
) 

 

Note that although it references a previous step, we won’t use it in the following step. We filter the table with a single distinctive gender column by keeping rows that contain the previously agreed-upon terms “hombr,” “var,” “mach,” “masc” and adding that it should not contain “trans” since that would be another gender. This gives us a list with all the results for our SQL IN clause. There are more than 50 results.

ibarrau_4-1724185439618.png

With that variable in the list, which we will call the replacement step that we were building, to apply it to our column:

 

= Table.ReplaceValue(
    #"Paso Origen",
    each [#"Me identifico (género)"],
    each if List.Contains( Lista_de_varones, [#"Me identifico (género)"] )
    then "Varón"
    else [#"Me identifico (género)"],
    Replacer.ReplaceText,
    {"Me identifico (género)"}
)

Notice how the list has been shortened. We went from almost 150 distinct gender values to about 80.

ibarrau_5-1724185482030.png

In this way, we could repeat the process for appearances related to Woman, Non-Binary, Trans Woman, Trans Man, and leave the rest as Other.

Before concluding, I’d like to emphasize that this is a very expensive process. Scanning the dataset to obtain the list that we then use for replacement can take a long time if the list takes a while to generate. The replacement is fast, but the list is not. Therefore, I strongly recommend hardcoding the list if the data source is a closed survey like this one. Perform the execution to identify the values, and once you know them, record them in another source:

ibarrau_6-1724185501109.png

NOTE: You can copy to Notepad++ and create a macro to add commas and quotation marks in less than a minute.

Now we have reached the end of the post, and we’ve learned how to replace values in a column based on custom conditions from any other column in the same row, including multiple replacements of text fragments. I hope this helps you clean up those dirty data sets that come our way.