Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
08152023
Frequent Visitor

Power Query: Identifying and removing duplicates based on column criteria

Hello! 

I am trying to figure out how to identify and remove duplicate rows in Power Query based on column criteria. The report displays Admissions data based on funnel stage (Prospect, Inquiry, App Start, Completed App, Accepted, Committed, Confirmed, Enrolled). In the "Contact Name" column exists many duplicate applicants who started multiple applications. I would like to keep the row that contains an "Application Status" column status of "Submitted" and remove all other rows. 

 

Any assistance would be greatly appreciated!

1 ACCEPTED SOLUTION

Hi @08152023 

You can create a blank query and put the following code to advanced editor in power query as an sample.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8srPyFMIzs0syVDSUQouSSwqSU0Bsgx13VKTlGJ1cCowwq7AOT+3ICcVosQYhxmlSUAWRIkJXElwYk5OpYJXfl5qMQ6H4FKBcEkyDp3JeNRjdzC6DNSdsQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Applicant Name" = _t, #"Application Status" = _t, Date = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Applicant Name", type text}, {"Application Status", type text}, {"Date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"Applicant Name", "Application Status"}, {{"Count", each Table.AddIndexColumn(_,"Index",1,1),type table}}),
    #"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"Date", "Index"}, {"Date", "Index"}),
    #"Added Custom" = Table.AddColumn(#"Expanded Count", "Custom", each let a=List.Max(Table.SelectRows(#"Expanded Count",(x)=>x[Applicant Name]=[Applicant Name])[Date]),
b=Table.RowCount(Table.SelectRows(#"Expanded Count",(x)=>x[Application Status]="Completed" and x[Applicant Name]=[Applicant Name])),
c=if (b>0 and [Date]=a) or (b=0 and [Index]=1) then 1 else 0
in c),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = 1)),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom"})
in
    #"Removed Columns"

Output

vxinruzhumsft_0-1709883295021.png

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

View solution in original post

6 REPLIES 6
08152023
Frequent Visitor

@v-xinruzhu-msft @j_ocean I have attached sample data below. 

 

To give further context, let's say I have one applicant, John Smith, who started multiple applications in the month of February. The data I am pulling in from our CRM would show multiple John Smiths in the contact column. What I want Power Query to do is analyze all the applications that John Smith started, and then filter and keep the application and contact based on the latest application status (started, compled, submitted). So if John started 3 applications and only submitted one of those applications, I need PQ to filter and keep the contact source with the application status "Submitted". 

 

Furthermore, if an applicant has 2 "Started" apps, I need to keep the applicant data with the earliest date (see below) 

 

 

Screen Shot 2024-03-07 at 11.19.11 AM.png

Group by name, status, and min date.


Then pivot status to get columns started, completed, submitted with dates.

 

Then add a conditional overall status column: if submitted <> null then "Submitted" and so on.

 

(I assume "Keep or Remove" is you manually typing it for our benefit.)

Hi @08152023 

You can create a blank query and put the following code to advanced editor in power query as an sample.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8srPyFMIzs0syVDSUQouSSwqSU0Bsgx13VKTlGJ1cCowwq7AOT+3ICcVosQYhxmlSUAWRIkJXElwYk5OpYJXfl5qMQ6H4FKBcEkyDp3JeNRjdzC6DNSdsQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Applicant Name" = _t, #"Application Status" = _t, Date = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Applicant Name", type text}, {"Application Status", type text}, {"Date", type date}}),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"Applicant Name", "Application Status"}, {{"Count", each Table.AddIndexColumn(_,"Index",1,1),type table}}),
    #"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"Date", "Index"}, {"Date", "Index"}),
    #"Added Custom" = Table.AddColumn(#"Expanded Count", "Custom", each let a=List.Max(Table.SelectRows(#"Expanded Count",(x)=>x[Applicant Name]=[Applicant Name])[Date]),
b=Table.RowCount(Table.SelectRows(#"Expanded Count",(x)=>x[Application Status]="Completed" and x[Applicant Name]=[Applicant Name])),
c=if (b>0 and [Date]=a) or (b=0 and [Index]=1) then 1 else 0
in c),
    #"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = 1)),
    #"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom"})
in
    #"Removed Columns"

Output

vxinruzhumsft_0-1709883295021.png

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

 

j_ocean
Helper V
Helper V

Another option would be to pivot on the funnel stage with a "count" summarization, then your table should have name as a unique key and a set of columns with the number of applications in each stage for each person (assuming you don't have any other application-specific columns hanging on that hose it up). You can then do a simple conditional column for "highest funnel stage completed" or something.

v-xinruzhu-msft
Community Support
Community Support

Hi @08152023 

Can you provide some sample data , the information you have offered is little , I don't know the structure of your data so that  I can't reproduce the problem you're experiencing.

 

Best Regards!

Yolo Zhu

 
 
j_ocean
Helper V
Helper V

  1. Reference-off a new query
  2. Filter the new query for rows with "Submitted"
  3. Merge this back into the original table, based on name (or id or whatever), expand out name and status
  4. Conditional column if the new status column (which should now be null or Submitted, nothing else) = the old status column, then "Match", else if old name = new name then "drop", else "in process"
  5. Filter out "drop"
  6. Remiove the temporary columns you just made

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors