Join us for an expert-led overview of the tools and concepts you'll need to pass exam PL-300. The first session starts on June 11th. See you there!
Get registeredPower BI is turning 10! Let’s celebrate together with dataviz contests, interactive sessions, and giveaways. Register now.
Hi,
By default, Power Query keeps the oldest record and deletes the newest record when removing duplicates. I need the opposite: delete the oldest record and keep the newest.
The following code does the trick. The problem is that it works record by record, and when the table size is large, the process slows down dramatically.
Max Index by duplicates =
List.Max( Table.SelectRows(#Index add, (r) = r[N_Invoice] = [N_Invoice])[Index] )
I need help to be able to do it more efficiently.
Thanks!!
Link to Dataset
https://drive.google.com/drive/folders/1LanSPa_O1_qirmJYPIBh1IaJQkMJ910_?usp=sharing
Solved! Go to Solution.
Assuming the latest is the last entry for any set of duplicates, you can
let
//Read in your data
Source = Csv.Document(File.Contents("C:\Users\ron\Desktop\Dataset.csv"),[Delimiter=";", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"N_Invoice", type text}, {"Index", Int64.Type}}),
//Group by N_invoice and retain the last row
#"Grouped Rows" = Table.Group(#"Changed Type", {"N_Invoice"}, {
{"Latest", each Table.Last(_), type [N_Invoice=nullable text, Index=nullable number]}}),
//remove column and re-expand
#"Removed Columns" = Table.RemoveColumns(#"Grouped Rows",{"N_Invoice"}),
#"Expanded Latest" = Table.ExpandRecordColumn(#"Removed Columns", "Latest", {"N_Invoice", "Index"}, {"N_Invoice", "Index"})
in
#"Expanded Latest"
This will also retain the original ordering of the entries. You can see in the screenshot below, FV-084482 is in the second row of both your original data, as well as the dedup'd data, even though the "latest" invoice was found near the bottom of the table as indicated by the associated Index number.
Note that Table.Group is not guaranteed to retain your original row order, which would be important in the above algorithm. However, others have reported that the presence of an Index column ameliorates this issue. So ensure that your actual data has an index column (or add one if it does not).
If there are issues of that nature, then merely change the custom aggregation in the Table.Group function to:
{"Latest", (t)=> Table.SelectRows(t, each [Index]=List.Max(t[Index])), type table [N_Invoice=nullable text, Index=nullable number]}})
If you want a different order, please specify.
Hi @GFire ,
Thank you for reaching out to the Fabric community.
You are correct by default, Power Query keeps the first (oldest) record when removing duplicates. However, if you prefer to keep the latest (newest) record, your method of finding the max index per duplicate is accurate.
I tested your scenario with your provided dataset and implemented a more optimized approach.
This solution avoids the row by row logic and performs efficiently on larger datasets.
FYI:
Thanks for your response @ronrsnfld & @Nasif_Azam .
Hope this helps. If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Assuming the latest is the last entry for any set of duplicates, you can
let
//Read in your data
Source = Csv.Document(File.Contents("C:\Users\ron\Desktop\Dataset.csv"),[Delimiter=";", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"N_Invoice", type text}, {"Index", Int64.Type}}),
//Group by N_invoice and retain the last row
#"Grouped Rows" = Table.Group(#"Changed Type", {"N_Invoice"}, {
{"Latest", each Table.Last(_), type [N_Invoice=nullable text, Index=nullable number]}}),
//remove column and re-expand
#"Removed Columns" = Table.RemoveColumns(#"Grouped Rows",{"N_Invoice"}),
#"Expanded Latest" = Table.ExpandRecordColumn(#"Removed Columns", "Latest", {"N_Invoice", "Index"}, {"N_Invoice", "Index"})
in
#"Expanded Latest"
This will also retain the original ordering of the entries. You can see in the screenshot below, FV-084482 is in the second row of both your original data, as well as the dedup'd data, even though the "latest" invoice was found near the bottom of the table as indicated by the associated Index number.
Note that Table.Group is not guaranteed to retain your original row order, which would be important in the above algorithm. However, others have reported that the presence of an Index column ameliorates this issue. So ensure that your actual data has an index column (or add one if it does not).
If there are issues of that nature, then merely change the custom aggregation in the Table.Group function to:
{"Latest", (t)=> Table.SelectRows(t, each [Index]=List.Max(t[Index])), type table [N_Invoice=nullable text, Index=nullable number]}})
If you want a different order, please specify.
Hey @GFire ,
You can efficiently keep the most recent record for each N_Invoice in Power Query by leveraging Table.Group and List.Max or by using a stable sort followed by Table.Distinct with a twist.
Sort by Index descending (newest records first).
Remove duplicates based on N_Invoice, keeping the first row (which is now the most recent due to the sort).
let Source = ... , // your original table here SortedTable = Table.Sort(Source, {{"Index", Order.Descending}}), Deduplicated = Table.Distinct(SortedTable, {"N_Invoice"}) in Deduplicated
Sorting by Index in descending order ensures the most recent entries come first. Table.Distinct keeps the first occurrence of each duplicate group (based on N_Invoice), which in this sorted state is the most recent record.
You mentioned this code which is a slow code so avoid that code: List.Max( Table.SelectRows(#"Index add", (r) => r[N_Invoice] = [N_Invoice])[Index] )
For Detailed Information:
Avoid row context for better performance
Remove duplicates and keep the last (or first) record in Power BI
If you found this solution helpful, please consider accepting it and giving it a kudos (Like) it’s greatly appreciated and helps others find the solution more easily.
Best Regards,
Nasif Azam
Hi,
It doesn't work. For example: N_Invoice number FV-084497 is in the dataset with index numbers 5 and 59. Once the code was applied, the resulting table removed index:59 and retained index:5.
Hi @GFire ,
Thanks for your feedback. Let's try this M code with confident:
let Source = ... , // Replace with your actual source Grouped = Table.Group(Source, {"N_Invoice"}, { {"TopRow", each Table.Sort(_, {{"Index", Order.Descending}}){0}, type record} }), Result = Table.FromRecords(Grouped[TopRow]) in Result
Groups the table by N_Invoice
Sorts each group in descending order by Index
Picks the first row from each group (i.e., the most recent one)
Flattens it back into a normal table
If you found this solution helpful, please consider accepting it and giving it a kudos (Like) it’s greatly appreciated and helps others find the solution more easily.
Best Regards,
Nasif Azam