Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us for an expert-led overview of the tools and concepts you'll need to become a Certified Power BI Data Analyst and pass exam PL-300. Register now.

Reply
Anonymous
Not applicable

Power Query append ,removing duplicate rows

To whom it may concern:

 

I am using power query to combine several different files, all the columns are the same, so it's straight forward in that context. Each row in every file is unique on an ID number, however, file over file there may be duplicates as the row may have been updated between dumps. What I would like is to be able to remove the row from the older file.

 

The data looks something like this:

File 1 - January 1

IDApplication Received DateApplication Assessed DateFunding Amount
1December 20, 2020December 21, 2020$400
2December 22, 2020December 22, 2020$500
3December 22, 2020  
4October 2, 2020October 4, 2020$1000

 

File 2 - January 30

IDApplication Received DateApplication Assessed Date Funding Amount
3December 22, 2020December 24, 2020$300
4November 1, 2020November 4, 2020$300
5November 3, 2020NOvember 5, 2020$500
6November 9, 2020November 11, 2020$600

 

All columns can change information except the ID, so I would not be able to group on anything. In the end, I would like to keep the ID files that are common to be from the more recent file, i.e. in the example above, I would like to keep ID 1 and 2 from file 1 and thne 3, 4, 5, and 6 from the second one.

 

I think the first step is to add in the date of the file as a column, and then group on the ID and date the file was created, but I don't know what to do from there.

 

Any help would be appreciated.

 

Thanks in advance,

 

Van

1 ACCEPTED SOLUTION
Jimmy801
Community Champion
Community Champion

Hello @Anonymous 

 

what you can do is to take the older file, join it with the newer file using a LeftAnti-join (IDs from the newer file will be removed) and then combining with the newer file. If you have a lot of files this could be some challengen. Another option could be to integrate the creation date of the file or another information that gives you somehow a information of time and then group by id, then taking only the date with max value. Here an example on what I mean

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJAYgC8gz1DUBIKVYnWskIRbURFhMQYiqm2EwwxqFaAYrRlJsAhfyTS/LBquGKYSImCMsMDUixDSGGZIQxignGCAf45ZdBVCOCBy5ESL8psmJjhH5/qJAprvCC6jdD1m+Jab8hUpSZYRgQCwA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t, #"Creation date" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Application Received Date", type date}, {"Application Assessed Date", type date}, {"Funding Amount", type text}, {"Creation date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"MaxCreationDate", each Table.Max(_,"Creation date")}}),
    #"Expanded MaxCreationDate" = Table.ExpandRecordColumn(#"Grouped Rows", "MaxCreationDate", {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"}, {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"})
in
    #"Expanded MaxCreationDate"

transforms this

Jimmy801_0-1613639105800.png

into this

Jimmy801_1-1613639155130.png

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

 

View solution in original post

4 REPLIES 4
Sergii24
Super User
Super User

Hi @Anonymous ,


What you're looking for is perfectly explained in this article https://radacad.com/grouping-in-power-query-getting-the-last-item-in-each-group. There is no much to add 🙂


Good luck!

P.S. I've used it by myself just few days ago and it worked exactly as expected! Radacad articles are always easy to follow.

Jimmy801
Community Champion
Community Champion

Hello @Anonymous 

 

what you can do is to take the older file, join it with the newer file using a LeftAnti-join (IDs from the newer file will be removed) and then combining with the newer file. If you have a lot of files this could be some challengen. Another option could be to integrate the creation date of the file or another information that gives you somehow a information of time and then group by id, then taking only the date with max value. Here an example on what I mean

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJAYgC8gz1DUBIKVYnWskIRbURFhMQYiqm2EwwxqFaAYrRlJsAhfyTS/LBquGKYSImCMsMDUixDSGGZIQxignGCAf45ZdBVCOCBy5ESL8psmJjhH5/qJAprvCC6jdD1m+Jab8hUpSZYRgQCwA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t, #"Creation date" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Application Received Date", type date}, {"Application Assessed Date", type date}, {"Funding Amount", type text}, {"Creation date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"ID"}, {{"MaxCreationDate", each Table.Max(_,"Creation date")}}),
    #"Expanded MaxCreationDate" = Table.ExpandRecordColumn(#"Grouped Rows", "MaxCreationDate", {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"}, {"Application Received Date", "Application Assessed Date", "Funding Amount", "Creation date"})
in
    #"Expanded MaxCreationDate"

transforms this

Jimmy801_0-1613639105800.png

into this

Jimmy801_1-1613639155130.png

Copy paste this code to the advanced editor in a new blank query to see how the solution works.

If this post helps or solves your problem, please mark it as solution (to help other users find useful content and to acknowledge the work of users that helped you)
Kudoes are nice too

Have fun

Jimmy

 

MattAllington
Community Champion
Community Champion

Are the IDs incrementing over time, always?  If so, you could do a group by all other columns (the ones the same) and then do a max of the ID column. 



* Matt is an 8 times Microsoft MVP (Power BI) and author of the Power BI Book Supercharge Power BI.
I will not give you bad advice, even if you unknowingly ask for it.
CNENFRNL
Community Champion
Community Champion

@Anonymous , of coz grouping all combined records by ID is a most straightforward and efficient solution to your issue.

 

First, combine all files in a CHRONOLOGICAL order (important!);

Secondly, index the combined records;

Thirdly, group the records by ID and keep the record with the max index.

let
    File1 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUXJJTU7NTUotUjAy0AFiIwMUMUO4mIqJgYFSrE60khGKAiMsmhBiKqZQTcY4FCiAMUiFCZDln1ySD1YAl4eJmCCMNDQAmRkLAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t]),
    File2 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlbSUXJJTU7NTUotUjAy0lEwMjAyQBEzgYupGBsYKMXqRCuZADl++WUQBYZwebgQFi2myPLGCC3+UCFThBZTqBYzZC2WmLYYImxWMQPpiQUA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, #"Application Received Date" = _t, #"Application Assessed Date" = _t, #"Funding Amount" = _t]),
    #"Combined Files" = File1 & File2,
    Columns = Table.ColumnNames(#"Combined Files"),
    #"Added Index" = Table.AddIndexColumn(#"Combined Files", "Index", 1, 1, Int64.Type),
    #"Grouped Rows" = Table.Group(#"Added Index", {"ID"}, {{"ar", each _, type table [ID=nullable text, Application Received Date=nullable text, Application Assessed Date=nullable text, Funding Amount=nullable text, Index=number]}}),
    #"Removed Columns" = Table.TransformColumns(Table.RemoveColumns(#"Grouped Rows",{"ID"}), {{"ar", each Table.Last(Table.Sort(_, {"Index", Order.Ascending}))}}),
    #"Expanded ar" = Table.ExpandRecordColumn(#"Removed Columns", "ar", Columns, Columns)
in
    #"Expanded ar"

Screenshot 2021-02-18 095603.png


Thanks to the great efforts by MS engineers to simplify syntax of DAX! Most beginners are SUCCESSFULLY MISLED to think that they could easily master DAX; but it turns out that the intricacy of the most frequently used RANKX() is still way beyond their comprehension!

DAX is simple, but NOT EASY!

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June 2025 Power BI Update Carousel

Power BI Monthly Update - June 2025

Check out the June 2025 Power BI update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.