Solved: Re: Selectively remove duplicate rows?

Anonymous · ‎01-08-2018

In the query editor, there is a function which allows a developer to remove duplicate rows based on the selected columns.

I want to add an additional condition to this. That is, I want to choose which row to remove based on the value in an unrelated columns.

For example, in the below table, assume I want to remove the duplicates according to Columns "ID" and "Date". However, I want to prioritize the removal of rows where ID = B. I don't want the removal to be arbitrary.

ID	Type	Date
1	B	1/1/2018
1	A	1/1/2018
2	A	1/1/2018
3	A	1/1/2018
4	A	1/1/2018
4	A	1/1/2018
5	A	1/1/2018
5	B	1/1/2018

M Function "Table.Distinct" -

Table.Distinct(table as table, optional equationCriteria as any)

It seems like I should be able to work it into the "optional equation criteria", but I'm just sort of guessing and coming up short. Some sort of if statement...?

jthomson · ‎01-09-2018

Power Query, if I'm reading my own pbix right where I have a similar situation (have appended a new report to an old one, and if there's a matching case number I want to keep the data from the new report), removes duplicates from the bottom up - if you make a custom column that gives anything that's type A the value 1, and type B the value 2, then sort ascending on the new column, this should force it to remove the rows you want

View solution in original post

jthomson · ‎01-09-2018

Power Query, if I'm reading my own pbix right where I have a similar situation (have appended a new report to an old one, and if there's a matching case number I want to keep the data from the new report), removes duplicates from the bottom up - if you make a custom column that gives anything that's type A the value 1, and type B the value 2, then sort ascending on the new column, this should force it to remove the rows you want

Anonymous · ‎01-09-2018

Interesting workaround. That should work...

Another question - is there a way to view the rows that are removed? My data set is almost a million rows. Applying something like this without directly viewing what was changed is a pain.

v-huizhn-msft · ‎01-09-2018

Hi @Anonymous,

After research and test, there is a way to view the rows moved using Query Statement.

You create a copy table of your resource table, remove the duplicate rows in your original table. Then merge the original table( have removed the duplicate rows) to the copy table using inner join. Then you can filtered the merged table to get all removed rows.

For example, my Sample table is named 'Table1', I create a copy table 'Copy_Table1'. Remove the duplicate rows in Table1 based on [Area], then merge it to Copy_Table1. Filter Copy_Table1 based on [Copy_Table1.Record Number]=[Table1.Record Number], eventually delete the unnecessay columns, you will get the expected result. The following is my statement.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdO7EoJADIXhV2G2piBZdoFSe8dCO4bCC63MqBS+vZBjg01OkSKTr/snfR8klOE0P+6XT1kcb+/pOj4LSWWhlTTLabeMaBjKPigho8lIyNpkTchkMhEym8yEbEyu62H6k3krW5MtITuTnS+1MikVQQV0rXSex9fWNluLTKKMRSiJjEUqcVrtV4pW4sQyilji1DKKWrZ7FLmkJSh6SefTiGDqBDOKYCoE/b2VEhS5NBIUtbR2yppFLk2MRS/NjF2CDV8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [#"Record Number" = _t, #"Starting Date" = _t, Area = _t, Count = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Record Number", Int64.Type}, {"Starting Date", type date}, {"Area", type text}, {"Count", Int64.Type}}),
    #"Merged Queries" = Table.NestedJoin(#"Changed Type",{"Area"},Table1,{"Area"},"Table1",JoinKind.Inner),
    #"Expanded Table1" = Table.ExpandTableColumn(#"Merged Queries", "Table1", {"Record Number", "Starting Date", "Area", "Count"}, {"Table1.Record Number", "Table1.Starting Date", "Table1.Area", "Table1.Count"}),
    #"Added Custom" = Table.AddColumn(#"Expanded Table1", "Custom", each if [Record Number]=[Table1.Record Number] then"Yes" else "No"),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each [Custom] = "No"),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Table1.Record Number", "Table1.Starting Date", "Table1.Area", "Table1.Count", "Custom"})
in
    #"Removed Columns"

You can download the .pbix file for more details.

Best Regards,
Angelia

Anonymous · ‎01-10-2018

Thank you @v-huizhn-msft

I believe I found a slightly simpler solution.

I took the original table (with sorting applied) and duplicated as you suggested. In the new table I removed the duplicates. Then I merged using Anti Join instead of Inner Join.

The Anti Join returns all rows from one table not found in the other table. Saved a few steps with this method.

MarcelBeug · ‎01-09-2018

Please note 1 important phenomenon (or bug if you like) when first sorting a table and then removing duplicates: you need to buffer the table after the sort, before removing duplicates, otherwise records may be removed without taking into account the sort order.

Sorted = Table.Sort(sort code)

Buffered = Table.Buffer(Sorted)

RemovedDuplicates = Table.Distinct(Buffered, equationCriteria)

or:

Sorted = Table.Buffer(Table.Sort(sort code))

RemovedDuplicates = Table.Distinct(Sorted, equationCriteria)

Parameter equationCritera may include field names and/or comparer functions.

With comparer functions you can ignore the case and/or use a culture code: e.g. "æ" and "ae" are considered equal in some cultures.

The parameter can take on various formats as illustrated in the examples below.

let
    Source = #table(type table[Text1 = text, Text2 = text],{{"a", "Encyclopaedia"},{"A","Encyclopædia"}}),

// ****   Format equationCriteria: comparer function
    // Result: both records ("a" and "A" are not equal):
    RemovedDuplicates1 = Table.Distinct(Source, Comparer.FromCulture("en-US")),

    // Result: 1 record (æ and ae are considered equal in "en-US"; true = ignore case))
    RemovedDuplicates2 = Table.Distinct(Source, Comparer.FromCulture("en-US", true)),

    // Result: 2 records (æ and ae are not considered equal in "da-DK")
    RemovedDuplicates3 = Table.Distinct(Source, Comparer.FromCulture("da-DK", true)),

    Indexed = Table.AddIndexColumn(RemovedDuplicates2, "Index", 0, 1),

// ****   Format equationCriteria: (list of) field name(s) (as generated from the user interface if columns were selected):
    RemovedDuplicates4 = Table.Distinct(Indexed, "Text1"),
    RemovedDuplicates5 = Table.Distinct(Indexed, {"Text1"}),
    RemovedDuplicates6 = Table.Distinct(Indexed, {"Text1","Text2"}),

// ****   Format equationCriteria: list of field name with comparer function:
    RemovedDuplicates7 = Table.Distinct(Indexed, {"Text1", Comparer.OrdinalIgnoreCase}),

// ****   Format equationCriteria: list of lists with field name and comparer function:
    RemovedDuplicates8 = Table.Distinct(Indexed, {{"Text1", Comparer.OrdinalIgnoreCase},{"Text2",Comparer.FromCulture("en-US")}})

in
    RemovedDuplicates8

Specializing in Power Query Formula Language (M)

Anonymous · ‎07-04-2019

The Table.Buffer was a lifesaver. I thought of sorting and then removing duplicates, but for a long time it seemed like it didn't work.

Anonymous · ‎01-10-2018

Thank you @MarcelBeug for the knowledge sharing. I'll include the step to buffer the table prior to removing duplicates.

Kudos.

edhans · ‎06-28-2018

IMHO the anti-join is the way to go here. I never like sorting and relying on some undocumented logic to always work or not change in the future.

Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling

Proud to be a Super User!

MCSA: BI Reporting

Selectively remove duplicate rows?

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

How to Get Your Question Answered Quickly

Party with Power BI’s own Guy in a Cube

Selectively remove duplicate rows?

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

How to Get Your Question Answered Quickly