Solved: Conditionnaly remove duplicates

Cado_one · ‎03-04-2022

Hello !

I face an issue with one table, I will try to reproduce this issue as a fictive exercise below.

Here is a table :

ID	NAME	SOURCE
1	PLANT A	MYSQL
2	PLANT B1	MYSQL
2	PLANT B2	MYSQL
3	PLANT C	MYSQL
4	PLANT D	WEB API
2	PLANT B3	WEB API
5	PLANT E	WEB API

As you can see there are duplicate IDs for the plants with ID = 2. The exercise consists in removing the ID duplicates from the second source WEB API only.

In this case, the PLANT B3 would be removed but the PLANT B1 and PLANT B2 would remain in the table.

How would you solve this case ?

One rule : I don't want to filter on column NAME to remove directly the PLANT B3, I need it to be automatic.

Thanks in advance.

Best regards,

CADO

AlexisOlson · ‎03-04-2022

Split the table into WEB API and not WEB API rows. In the former, remove rows with IDs that exist in the latter and then append them together.

Try pasting this into the Advanced Editor of a new blank query to see the steps:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUQrwcfQLUXAEsnwjgwN9lGJ1opWM4OJOhrgkjFAkjOESzijiJnBxFyAr3NVJwTHAE90oYzQpU7iUK7JMLAA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, NAME = _t, SOURCE = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"NAME", type text}, {"SOURCE", type text}}),
    #"Filtered NonWEB" = Table.SelectRows(#"Changed Type", each ([SOURCE] <> "WEB API")),
    #"Filtered WEBAPI" = Table.SelectRows(#"Changed Type", each ([SOURCE] = "WEB API")),
    #"Removed Duplicates" = Table.SelectRows(#"Filtered WEBAPI", each not List.Contains(#"Filtered NonWEB"[ID], [ID])),
    #"Appended Query" = Table.Combine({#"Filtered NonWEB", #"Removed Duplicates"})
in
    #"Appended Query"

For large datasets, it's probably more efficient to use an anti-join rather than List.Contains but the idea is the same.

View solution in original post

Cado_one · ‎03-07-2022

Hi @AlexisOlson

Thank you for the trick it's working !

A little bit slow with 200 records so I will try the anti-join option.

Have a good day !

Cado

AlexisOlson · ‎03-04-2022

Split the table into WEB API and not WEB API rows. In the former, remove rows with IDs that exist in the latter and then append them together.

Try pasting this into the Advanced Editor of a new blank query to see the steps:

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUQrwcfQLUXAEsnwjgwN9lGJ1opWM4OJOhrgkjFAkjOESzijiJnBxFyAr3NVJwTHAE90oYzQpU7iUK7JMLAA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, NAME = _t, SOURCE = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"NAME", type text}, {"SOURCE", type text}}),
    #"Filtered NonWEB" = Table.SelectRows(#"Changed Type", each ([SOURCE] <> "WEB API")),
    #"Filtered WEBAPI" = Table.SelectRows(#"Changed Type", each ([SOURCE] = "WEB API")),
    #"Removed Duplicates" = Table.SelectRows(#"Filtered WEBAPI", each not List.Contains(#"Filtered NonWEB"[ID], [ID])),
    #"Appended Query" = Table.Combine({#"Filtered NonWEB", #"Removed Duplicates"})
in
    #"Appended Query"

For large datasets, it's probably more efficient to use an anti-join rather than List.Contains but the idea is the same.