Remove specific, partially duplicate rows

cathoms · ‎05-21-2021

I am building a dataset/model from an Excel workbook prepared by others and split into multiple worksheets. The worksheets are split into lists of providers and their counts of (1) established visits and (2) new visits and the same but where providers have fewer than 26 (3) established visits or (4) new visits. Also, I am getting the provider ids from yet another excel workbook.

Anywho, in combining the queries so that I would have all Established visits in one query and all New visits in another query each with the provider id (i.e. NPI), the fact that there are two providers with the same name has caused a problem where I have duplicate values in the NPI column but the corresponding rows are not duplicates. See image:

So, I tried setting up a custom column to create duplicates that I could then remove, like so:

each if ([NPI]="1760810293" and [taxonomy_key]="14" and [group_key]="2") then "1871035881"
else if ([NPI]="1871035881" and [taxonomy_key]="54" and [group_key]="95") then "1760810293"
else [NPI])

As you can see in the image above, that didn't work.

Essentially, I need to get rid of rows 1 and 3. Note that the above image is filtered to show the two NPIs of interest. There are a total of 697 rows, should be 695. Help! What should I do?

Fowmy · ‎05-21-2021

@cathoms

I am not sure if the order and the pattern of the actual records are similar to what you have shown in the sample. I merged all the columns except NPI and applied "Remove Duplicates" which basically keeps the first occurrence and removes the rest. Check the following code after pasting it in a blank query on the Advanced Editor.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMjCxNFbSUTI0ARJGQGxibAYkLZRidaKVTC0sDAnImYLkLE1BqgwtQErAklBDsUjGAgA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [A = _t, B = _t, C = _t, D = _t, E = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"A", Int64.Type}, {"B", Int64.Type}, {"C", Int64.Type}, {"D", Int64.Type}, {"E", Int64.Type}}),
    #"Inserted Merged Column" = Table.AddColumn(#"Changed Type", "Merged", each Text.Combine({Text.From([B], "en-US"), Text.From([C], "en-US"), Text.From([D], "en-US"), Text.From([E], "en-US")}, "|"), type text),
    #"Removed Duplicates" = Table.Distinct(#"Inserted Merged Column", {"Merged"})
in
    #"Removed Duplicates"

Did I answer your question? Mark my post as a solution! and hit thumbs up

⭕ Subscribe and learn Power BI from these videos

⚪ Website ⚪ LinkedIn ⚪ PBI User Group

cathoms · ‎05-21-2021

So basically you are creating a new column based on all the other columns, then removing duplicates based on that new column. What you are showing gets rid of the wrong rows. However, if remove duplicates keeps the first occurence, would it work if I sort descending?

Also, could I just remove rows based on all the other columns without the need to create a new merged column?