Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
In the query editor, there is a function which allows a developer to remove duplicate rows based on the selected columns.
I want to add an additional condition to this. That is, I want to choose which row to remove based on the value in an unrelated columns.
For example, in the below table, assume I want to remove the duplicates according to Columns "ID" and "Date". However, I want to prioritize the removal of rows where ID = B. I don't want the removal to be arbitrary.
ID | Type | Date |
1 | B | 1/1/2018 |
1 | A | 1/1/2018 |
2 | A | 1/1/2018 |
3 | A | 1/1/2018 |
4 | A | 1/1/2018 |
4 | A | 1/1/2018 |
5 | A | 1/1/2018 |
5 | B | 1/1/2018 |
M Function "Table.Distinct" -
Table.Distinct(table as table, optional equationCriteria as any)
It seems like I should be able to work it into the "optional equation criteria", but I'm just sort of guessing and coming up short. Some sort of if statement...?
Solved! Go to Solution.
Power Query, if I'm reading my own pbix right where I have a similar situation (have appended a new report to an old one, and if there's a matching case number I want to keep the data from the new report), removes duplicates from the bottom up - if you make a custom column that gives anything that's type A the value 1, and type B the value 2, then sort ascending on the new column, this should force it to remove the rows you want
Power Query, if I'm reading my own pbix right where I have a similar situation (have appended a new report to an old one, and if there's a matching case number I want to keep the data from the new report), removes duplicates from the bottom up - if you make a custom column that gives anything that's type A the value 1, and type B the value 2, then sort ascending on the new column, this should force it to remove the rows you want
Interesting workaround. That should work...
Another question - is there a way to view the rows that are removed? My data set is almost a million rows. Applying something like this without directly viewing what was changed is a pain.
Hi @Anonymous,
After research and test, there is a way to view the rows moved using Query Statement.
You create a copy table of your resource table, remove the duplicate rows in your original table. Then merge the original table( have removed the duplicate rows) to the copy table using inner join. Then you can filtered the merged table to get all removed rows.
For example, my Sample table is named 'Table1', I create a copy table 'Copy_Table1'. Remove the duplicate rows in Table1 based on [Area], then merge it to Copy_Table1. Filter Copy_Table1 based on [Copy_Table1.Record Number]=[Table1.Record Number], eventually delete the unnecessay columns, you will get the expected result. The following is my statement.
let Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jdO7EoJADIXhV2G2piBZdoFSe8dCO4bCC63MqBS+vZBjg01OkSKTr/snfR8klOE0P+6XT1kcb+/pOj4LSWWhlTTLabeMaBjKPigho8lIyNpkTchkMhEym8yEbEyu62H6k3krW5MtITuTnS+1MikVQQV0rXSex9fWNluLTKKMRSiJjEUqcVrtV4pW4sQyilji1DKKWrZ7FLmkJSh6SefTiGDqBDOKYCoE/b2VEhS5NBIUtbR2yppFLk2MRS/NjF2CDV8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [#"Record Number" = _t, #"Starting Date" = _t, Area = _t, Count = _t]), #"Changed Type" = Table.TransformColumnTypes(Source,{{"Record Number", Int64.Type}, {"Starting Date", type date}, {"Area", type text}, {"Count", Int64.Type}}), #"Merged Queries" = Table.NestedJoin(#"Changed Type",{"Area"},Table1,{"Area"},"Table1",JoinKind.Inner), #"Expanded Table1" = Table.ExpandTableColumn(#"Merged Queries", "Table1", {"Record Number", "Starting Date", "Area", "Count"}, {"Table1.Record Number", "Table1.Starting Date", "Table1.Area", "Table1.Count"}), #"Added Custom" = Table.AddColumn(#"Expanded Table1", "Custom", each if [Record Number]=[Table1.Record Number] then"Yes" else "No"), #"Filtered Rows" = Table.SelectRows(#"Added Custom", each [Custom] = "No"), #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Table1.Record Number", "Table1.Starting Date", "Table1.Area", "Table1.Count", "Custom"}) in #"Removed Columns"
You can download the .pbix file for more details.
Best Regards,
Angelia
Thank you @v-huizhn-msft
I believe I found a slightly simpler solution.
I took the original table (with sorting applied) and duplicated as you suggested. In the new table I removed the duplicates. Then I merged using Anti Join instead of Inner Join.
The Anti Join returns all rows from one table not found in the other table. Saved a few steps with this method.
Please note 1 important phenomenon (or bug if you like) when first sorting a table and then removing duplicates: you need to buffer the table after the sort, before removing duplicates, otherwise records may be removed without taking into account the sort order.
Sorted = Table.Sort(sort code)
Buffered = Table.Buffer(Sorted)
RemovedDuplicates = Table.Distinct(Buffered, equationCriteria)
or:
Sorted = Table.Buffer(Table.Sort(sort code))
RemovedDuplicates = Table.Distinct(Sorted, equationCriteria)
Parameter equationCritera may include field names and/or comparer functions.
With comparer functions you can ignore the case and/or use a culture code: e.g. "æ" and "ae" are considered equal in some cultures.
The parameter can take on various formats as illustrated in the examples below.
let Source = #table(type table[Text1 = text, Text2 = text],{{"a", "Encyclopaedia"},{"A","Encyclopædia"}}), // **** Format equationCriteria: comparer function // Result: both records ("a" and "A" are not equal): RemovedDuplicates1 = Table.Distinct(Source, Comparer.FromCulture("en-US")), // Result: 1 record (æ and ae are considered equal in "en-US"; true = ignore case)) RemovedDuplicates2 = Table.Distinct(Source, Comparer.FromCulture("en-US", true)), // Result: 2 records (æ and ae are not considered equal in "da-DK") RemovedDuplicates3 = Table.Distinct(Source, Comparer.FromCulture("da-DK", true)), Indexed = Table.AddIndexColumn(RemovedDuplicates2, "Index", 0, 1), // **** Format equationCriteria: (list of) field name(s) (as generated from the user interface if columns were selected): RemovedDuplicates4 = Table.Distinct(Indexed, "Text1"), RemovedDuplicates5 = Table.Distinct(Indexed, {"Text1"}), RemovedDuplicates6 = Table.Distinct(Indexed, {"Text1","Text2"}), // **** Format equationCriteria: list of field name with comparer function: RemovedDuplicates7 = Table.Distinct(Indexed, {"Text1", Comparer.OrdinalIgnoreCase}), // **** Format equationCriteria: list of lists with field name and comparer function: RemovedDuplicates8 = Table.Distinct(Indexed, {{"Text1", Comparer.OrdinalIgnoreCase},{"Text2",Comparer.FromCulture("en-US")}}) in RemovedDuplicates8
The Table.Buffer was a lifesaver. I thought of sorting and then removing duplicates, but for a long time it seemed like it didn't work.
Thank you @MarcelBeug for the knowledge sharing. I'll include the step to buffer the table prior to removing duplicates.
Kudos.
IMHO the anti-join is the way to go here. I never like sorting and relying on some undocumented logic to always work or not change in the future.
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI ReportingStarting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.
User | Count |
---|---|
94 | |
92 | |
83 | |
71 | |
49 |
User | Count |
---|---|
143 | |
120 | |
110 | |
59 | |
57 |