Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now

Reply
Anonymous
Not applicable

Selectively remove duplicate rows?

In the query editor, there is a function which allows a developer to remove duplicate rows based on the selected columns.

 

I want to add an additional condition to this.  That is, I want to choose which row to remove based on the value in an unrelated columns.

 

For example, in the below table, assume I want to remove the duplicates according to Columns "ID" and "Date".  However, I want to prioritize the removal of rows where ID = B.  I don't want the removal to be arbitrary.

 

IDTypeDate
1B1/1/2018
1A1/1/2018
2A1/1/2018
3A1/1/2018
4A1/1/2018
4A1/1/2018
5A1/1/2018
5B1/1/2018

 

M Function "Table.Distinct" - 

Table.Distinct(table as table, optional equationCriteria as any)


It seems like I should be able to work it into the "optional equation criteria", but I'm just sort of guessing and coming up short.  Some sort of if statement...?

1 ACCEPTED SOLUTION
jthomson
Solution Sage
Solution Sage

Power Query, if I'm reading my own pbix right where I have a similar situation (have appended a new report to an old one, and if there's a matching case number I want to keep the data from the new report), removes duplicates from the bottom up - if you make a custom column that gives anything that's type A the value 1, and type B the value 2, then sort ascending on the new column, this should force it to remove the rows you want

View solution in original post

8 REPLIES 8
jthomson
Solution Sage
Solution Sage

Power Query, if I'm reading my own pbix right where I have a similar situation (have appended a new report to an old one, and if there's a matching case number I want to keep the data from the new report), removes duplicates from the bottom up - if you make a custom column that gives anything that's type A the value 1, and type B the value 2, then sort ascending on the new column, this should force it to remove the rows you want

Anonymous
Not applicable

Interesting workaround.  That should work...

Another question - is there a way to view the rows that are removed?  My data set is almost a million rows.  Applying something like this without directly viewing what was changed is a pain.

Hi @Anonymous,

After research and test, there is a way to view the rows moved using Query Statement.

You create a copy table of your resource table, remove the duplicate rows in your original table. Then merge the original table( have removed the duplicate rows) to the copy table using inner join. Then you can filtered the merged table to get all removed rows.

For example, my Sample table is named 'Table1', I create a copy table 'Copy_Table1'. Remove the duplicate rows in Table1 based on [Area], then merge it to Copy_Table1. Filter Copy_Table1 based on [Copy_Table1.Record Number]=[Table1.Record Number], eventually delete the unnecessay columns, you will get the expected result. The following is my statement.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdO7EoJADIXhV2G2piBZdoFSe8dCO4bCC63MqBS+vZBjg01OkSKTr/snfR8klOE0P+6XT1kcb+/pOj4LSWWhlTTLabeMaBjKPigho8lIyNpkTchkMhEym8yEbEyu62H6k3krW5MtITuTnS+1MikVQQV0rXSex9fWNluLTKKMRSiJjEUqcVrtV4pW4sQyilji1DKKWrZ7FLmkJSh6SefTiGDqBDOKYCoE/b2VEhS5NBIUtbR2yppFLk2MRS/NjF2CDV8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [#"Record Number" = _t, #"Starting Date" = _t, Area = _t, Count = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Record Number", Int64.Type}, {"Starting Date", type date}, {"Area", type text}, {"Count", Int64.Type}}),
    #"Merged Queries" = Table.NestedJoin(#"Changed Type",{"Area"},Table1,{"Area"},"Table1",JoinKind.Inner),
    #"Expanded Table1" = Table.ExpandTableColumn(#"Merged Queries", "Table1", {"Record Number", "Starting Date", "Area", "Count"}, {"Table1.Record Number", "Table1.Starting Date", "Table1.Area", "Table1.Count"}),
    #"Added Custom" = Table.AddColumn(#"Expanded Table1", "Custom", each if [Record Number]=[Table1.Record Number] then"Yes" else "No"),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each [Custom] = "No"),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Table1.Record Number", "Table1.Starting Date", "Table1.Area", "Table1.Count", "Custom"})
in
    #"Removed Columns"


You can download the .pbix file for more details.

Best Regards,
Angelia

Anonymous
Not applicable

Thank you @v-huizhn-msft

 

I believe I found a slightly simpler solution.

I took the original table (with sorting applied) and duplicated as you suggested.  In the new table I removed the duplicates.  Then I merged using Anti Join instead of Inner Join.

 

The Anti Join returns all rows from one table not found in the other table.  Saved a few steps with this method.

Please note 1 important phenomenon (or bug if you like) when first sorting a table and then removing duplicates: you need to buffer the table after the sort, before removing duplicates, otherwise records may be removed without taking into account the sort order.

 

Sorted = Table.Sort(sort code)

Buffered = Table.Buffer(Sorted)

RemovedDuplicates = Table.Distinct(Buffered, equationCriteria)

 

or:

 

Sorted = Table.Buffer(Table.Sort(sort code))

RemovedDuplicates = Table.Distinct(Sorted, equationCriteria)

 

Parameter equationCritera may include field names and/or comparer functions.

With comparer functions you can ignore the case and/or use a culture code: e.g. "æ" and "ae" are considered equal in some cultures.

The parameter can take on various formats as illustrated in the examples below.

 

let
    Source = #table(type table[Text1 = text, Text2 = text],{{"a", "Encyclopaedia"},{"A","Encyclopædia"}}),

// ****   Format equationCriteria: comparer function
    // Result: both records ("a" and "A" are not equal):
    RemovedDuplicates1 = Table.Distinct(Source, Comparer.FromCulture("en-US")),

    // Result: 1 record (æ and ae are considered equal in "en-US"; true = ignore case))
    RemovedDuplicates2 = Table.Distinct(Source, Comparer.FromCulture("en-US", true)),

    // Result: 2 records (æ and ae are not considered equal in "da-DK")
    RemovedDuplicates3 = Table.Distinct(Source, Comparer.FromCulture("da-DK", true)),

    Indexed = Table.AddIndexColumn(RemovedDuplicates2, "Index", 0, 1),

// ****   Format equationCriteria: (list of) field name(s) (as generated from the user interface if columns were selected):
    RemovedDuplicates4 = Table.Distinct(Indexed, "Text1"),
    RemovedDuplicates5 = Table.Distinct(Indexed, {"Text1"}),
    RemovedDuplicates6 = Table.Distinct(Indexed, {"Text1","Text2"}),

// ****   Format equationCriteria: list of field name with comparer function:
    RemovedDuplicates7 = Table.Distinct(Indexed, {"Text1", Comparer.OrdinalIgnoreCase}),

// ****   Format equationCriteria: list of lists with field name and comparer function:
    RemovedDuplicates8 = Table.Distinct(Indexed, {{"Text1", Comparer.OrdinalIgnoreCase},{"Text2",Comparer.FromCulture("en-US")}})

in
    RemovedDuplicates8

 

Specializing in Power Query Formula Language (M)
Anonymous
Not applicable

The Table.Buffer was a lifesaver. I thought of sorting and then removing duplicates, but for a long time it seemed like it didn't work.

Anonymous
Not applicable

Thank you @MarcelBeug for the knowledge sharing.  I'll include the step to buffer the table prior to removing duplicates.

 

Kudos.

 

IMHO the anti-join is the way to go here. I never like sorting and relying on some undocumented logic to always work or not change in the future.



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

Helpful resources

Announcements
OCT PBI Update Carousel

Power BI Monthly Update - October 2024

Check out the October 2024 Power BI update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

October NL Carousel

Fabric Community Update - October 2024

Find out what's new and trending in the Fabric Community.