The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
Hello, experienced Power Query people!
I have a huge set of data in which there are 3 columns of data that should be standardized (and then a bunch of other columns). I also have a table that has the standard values for those 3 columns aligned correctly. I am using PQ to give me a set of data that shows where my huge set of data has either non-standard values for those columns or where the value in one of the columns isn't what it is supposed to be based on the other columns.
Acceptable values table example:
Thing Type Price
Kiwi Fruit $3
Cuke Veg $2
But sometimes in my huge data set someone made a mistake and either put the wrong type or price (or both!) with the wrong thing or put in a thing, type, or price (or any combination) that isn't in the list.
Data set table example:
Thing Type Price Buyer
Kiwi Fruit $3 Xi
Cuke Veg $2 Juan
Kiwy Fruit $3 Mary
Cuke Veg $3 Shanti
Mismatched data table example:
Thing Type Price Buyer
Kiwy Fruit $3 Mary
Cuke Veg $3 Shanti
I'm able to get the mismatched data table via PQ but I want to somehow highlight which cell is the one that is the mismatch for each row. So "Kiwy" because it is not in the list of accepted things and the second "$3" because it doesn't match up with what that value should be for a Cuke.
What I would really love is to have a way to get PQ to tell me which values didn't match and thus caused the row to appear in the merge. Is this possible?
If not...I tried making a query that just finds all the "Things" in the data that aren't in the list of accepted "Thing" values and then merging that with the query that pulled in ALL the mismatches so that I could get a column that just listed the "Thing" when it was wrong (so I could do conditional formatting in Excel based on values existing in that column).
Mismatched data table with extra column example:
Thing Type Price Buyer WrongThing
Kiwy Fruit $3 Mary Kiwy
Cuke Veg $3 Shanti
I tried an outer join where I kept all values from the mismatch table and only the matching wrong_thing_names table then expanded only the thing column in the wrong_thing_names table. There are NO duplicate rows in either table. And yet, when I do the merge I get a huge number of duplicated rows. Then I tried all the other merge options and none of them kept the 3,111 rows I had in the mismatch table. I have read lots of articles and watched a few videos where people say "why do I have duplicate rows in my merged data?" and the answer is always "there are duplicate rows in one of your tables" but I have checked that and there really aren't. What else could be going wrong???
Thank you, kind people!
Solved! Go to Solution.
Sure. I added comments to the code. The syntax of the M language is very concise, making it cryptic at times, but I hope this helps. There a re many good resources on Youtube on PowerQuery and the M language if you want to get a better understanding of M. M is awesomely powerful and if you have complex data cleaning jobs regularly, it's worth it!
let
Source = #"Data Values", // Getting the source data from the workbook
#"Added Is Type Correct" = // adding a column with a true value if the Type is correct and otherwise false
Table.AddColumn(Source, "Is Type Correct", // The Table.AddColumn starts with an input table and the name of the new column. It will loop through all records and call the function you supply in the next parameter.
each /* `each` is a shorthand for a function definition. it is equivalent to `(_) +>`,
The function will be called by Table.AddColumn with 1 parameter: The current record.
So, within the function you can refer to _ if you need to refer to the current record.
*/
List.Contains( // The List.Contains will do the actual checking
#"Acceptable Values"[Type]
/* This refers to the column [Type] in the table #"Acceptable Values" from the spreadsheet.
This is a list of all the values in the colum */
, [Type] /* And this is short for _[Type] and refers to the field [Type] in the xurrent record.
List.Contains will see if this value is in the list.*/
) // ending List.Contains
), // Table.AddColumn
#"Added Is Thing Correct" = Table.AddColumn(#"Added Is Type Correct", "Is Thing Correct", // adding another column
each List.Contains( // again, a function `(_) =>`
Table.SelectRows( /* function to get rows from a table. This function is also used when you filter using the UI
we want a list of Things matching the Type in the row we are adding the column to */
#"Acceptable Values" // The table to select from: Is coming from the excelsheet heet
, (acceptable) => /* SelectRows also needs a function with 1 paremeter: the current row
We can't use `each` here because it would conlict with the `each` in the AddColumn
I named the parameter "acceptable" */
acceptable[Type] = [Type] // select the rows from Accepatble values where the Type = the Type in the current row
) // end select rows
[Thing] // From the return selection of rows, get the column Thing as a list
, [Thing] // See if [Thing] in the current row exsts in the accepatble [Thing]s
) // End list contains
),
#"Added Is Price Correct" = // similar to the previous colum, only atching 2 columns
Table.AddColumn(#"Added Is Thing Correct", "Is Price Correct",
each List.Contains(Table.SelectRows(#"Acceptable Values",
(acceptable) =>
[Type] = acceptable[Type]
and [Thing] = acceptable[Thing]
)[Price]
, [Price]
)
),
#"Added Is Correct" = // New column returning true when all 3 new columns are true
Table.AddColumn(#"Added Is Price Correct", "Is Correct", each [Is Type Correct] and [Is Thing Correct] and [Is Price Correct]),
#"Filtered Only Errors" = // Only return the rows in the #"Data Values" table with errors
Table.SelectRows(#"Added Is Correct", each ([Is Correct] = false))
in
#"Filtered Only Errors"
Sure. I added comments to the code. The syntax of the M language is very concise, making it cryptic at times, but I hope this helps. There a re many good resources on Youtube on PowerQuery and the M language if you want to get a better understanding of M. M is awesomely powerful and if you have complex data cleaning jobs regularly, it's worth it!
let
Source = #"Data Values", // Getting the source data from the workbook
#"Added Is Type Correct" = // adding a column with a true value if the Type is correct and otherwise false
Table.AddColumn(Source, "Is Type Correct", // The Table.AddColumn starts with an input table and the name of the new column. It will loop through all records and call the function you supply in the next parameter.
each /* `each` is a shorthand for a function definition. it is equivalent to `(_) +>`,
The function will be called by Table.AddColumn with 1 parameter: The current record.
So, within the function you can refer to _ if you need to refer to the current record.
*/
List.Contains( // The List.Contains will do the actual checking
#"Acceptable Values"[Type]
/* This refers to the column [Type] in the table #"Acceptable Values" from the spreadsheet.
This is a list of all the values in the colum */
, [Type] /* And this is short for _[Type] and refers to the field [Type] in the xurrent record.
List.Contains will see if this value is in the list.*/
) // ending List.Contains
), // Table.AddColumn
#"Added Is Thing Correct" = Table.AddColumn(#"Added Is Type Correct", "Is Thing Correct", // adding another column
each List.Contains( // again, a function `(_) =>`
Table.SelectRows( /* function to get rows from a table. This function is also used when you filter using the UI
we want a list of Things matching the Type in the row we are adding the column to */
#"Acceptable Values" // The table to select from: Is coming from the excelsheet heet
, (acceptable) => /* SelectRows also needs a function with 1 paremeter: the current row
We can't use `each` here because it would conlict with the `each` in the AddColumn
I named the parameter "acceptable" */
acceptable[Type] = [Type] // select the rows from Accepatble values where the Type = the Type in the current row
) // end select rows
[Thing] // From the return selection of rows, get the column Thing as a list
, [Thing] // See if [Thing] in the current row exsts in the accepatble [Thing]s
) // End list contains
),
#"Added Is Price Correct" = // similar to the previous colum, only atching 2 columns
Table.AddColumn(#"Added Is Thing Correct", "Is Price Correct",
each List.Contains(Table.SelectRows(#"Acceptable Values",
(acceptable) =>
[Type] = acceptable[Type]
and [Thing] = acceptable[Thing]
)[Price]
, [Price]
)
),
#"Added Is Correct" = // New column returning true when all 3 new columns are true
Table.AddColumn(#"Added Is Price Correct", "Is Correct", each [Is Type Correct] and [Is Thing Correct] and [Is Price Correct]),
#"Filtered Only Errors" = // Only return the rows in the #"Data Values" table with errors
Table.SelectRows(#"Added Is Correct", each ([Is Correct] = false))
in
#"Filtered Only Errors"
Thank you, this is fantastic! I was very close to following it, as it turns out, but there were some key things that I didn't quite get. I very much appreciate your time and effort to explain. It is indeed looking like I'll have to do some more formal learning of M query. Up til now I've been able to piece together my complicated processing by just looking at other code examples and following them from basic programming knowledge but as you say, M seems to be particularly cryptic. 🙂
You're welcome. Have fun with M. It's a steep learning curve, but once you reach the top, the view is magnificant 😊
Here my solutions
let
Source = #"Data Values",
#"Added Is Type Correct" = Table.AddColumn(Source, "Is Type Correct", each List.Contains( #"Acceptable Values"[Type], [Type])),
#"Added Is Thing Correct" = Table.AddColumn(#"Added Is Type Correct", "Is Thing Correct", each List.Contains(Table.SelectRows(#"Acceptable Values", (acceptable) => [Type] = acceptable[Type])[Thing], [Thing])),
#"Added Is Price Correct" = Table.AddColumn(#"Added Is Thing Correct", "Is Price Correct", each List.Contains(Table.SelectRows(#"Acceptable Values",
(acceptable) =>
[Type] = acceptable[Type]
and [Thing] = acceptable[Thing]
)[Price]
, [Price])),
#"Added Is Correct" = Table.AddColumn(#"Added Is Price Correct", "Is Correct", each [Is Type Correct] and [Is Thing Correct] and [Is Price Correct]),
#"Filtered Only Errors" = Table.SelectRows(#"Added Is Correct", each ([Is Correct] = false))
in
#"Filtered Only Errors"
Thank you for this! Unfortunately, I am not able to follow the syntax well enough to tweak it for what I need. Can you explain the syntax of the comparing part?
I'm thinking, based on what folks have pointed out, that I need 5 check columns:
Is that right?
I think the step in your example for "Added Is Thing Correct" (and then the definition of the function (?) "acceptable") is the part that is checking the pairings but I can't follow the syntax. Would it be possible to explain what all the parts of that are doing?
I very much appreciate it!
I can't send my actual data set because it's huge and also has info I can't.
If you have a mismatch there wil definitly be multiple rows in the acceptible values table that are NOT matched.
So how do you pick the right one?
If you assume (wild assumption here!) that only 1 field is wrong, you can find out which one is wrong by looking for the ones that are right. But beware: There may be multiple "acceptable values" for any combination of 2 fields filled correctly...
For example if I have additional acceptable value:
Thing Type Price
Apple Fruit $3
In that case
Kiwy Fruit $3 Mary
Will not match 2 acceptible values rows.
And if you have a data row as follows??
Kiwy Fruit $5 Mary
So a merge join will not work...
If you can elaborate on exactly what you are looking for in these situations, I may be of help...
What should be unique in the acceptable values?
I guess
In that case