Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hello!
I am trying to figure out how to identify and remove duplicate rows in Power Query based on column criteria. The report displays Admissions data based on funnel stage (Prospect, Inquiry, App Start, Completed App, Accepted, Committed, Confirmed, Enrolled). In the "Contact Name" column exists many duplicate applicants who started multiple applications. I would like to keep the row that contains an "Application Status" column status of "Submitted" and remove all other rows.
Any assistance would be greatly appreciated!
Solved! Go to Solution.
Hi @08152023
You can create a blank query and put the following code to advanced editor in power query as an sample.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8srPyFMIzs0syVDSUQouSSwqSU0Bsgx13VKTlGJ1cCowwq7AOT+3ICcVosQYhxmlSUAWRIkJXElwYk5OpYJXfl5qMQ6H4FKBcEkyDp3JeNRjdzC6DNSdsQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Applicant Name" = _t, #"Application Status" = _t, Date = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Applicant Name", type text}, {"Application Status", type text}, {"Date", type date}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"Applicant Name", "Application Status"}, {{"Count", each Table.AddIndexColumn(_,"Index",1,1),type table}}),
#"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"Date", "Index"}, {"Date", "Index"}),
#"Added Custom" = Table.AddColumn(#"Expanded Count", "Custom", each let a=List.Max(Table.SelectRows(#"Expanded Count",(x)=>x[Applicant Name]=[Applicant Name])[Date]),
b=Table.RowCount(Table.SelectRows(#"Expanded Count",(x)=>x[Application Status]="Completed" and x[Applicant Name]=[Applicant Name])),
c=if (b>0 and [Date]=a) or (b=0 and [Index]=1) then 1 else 0
in c),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = 1)),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom"})
in
#"Removed Columns"
Output
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
@v-xinruzhu-msft @j_ocean I have attached sample data below.
To give further context, let's say I have one applicant, John Smith, who started multiple applications in the month of February. The data I am pulling in from our CRM would show multiple John Smiths in the contact column. What I want Power Query to do is analyze all the applications that John Smith started, and then filter and keep the application and contact based on the latest application status (started, compled, submitted). So if John started 3 applications and only submitted one of those applications, I need PQ to filter and keep the contact source with the application status "Submitted".
Furthermore, if an applicant has 2 "Started" apps, I need to keep the applicant data with the earliest date (see below)
Group by name, status, and min date.
Then pivot status to get columns started, completed, submitted with dates.
Then add a conditional overall status column: if submitted <> null then "Submitted" and so on.
(I assume "Keep or Remove" is you manually typing it for our benefit.)
Hi @08152023
You can create a blank query and put the following code to advanced editor in power query as an sample.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8srPyFMIzs0syVDSUQouSSwqSU0Bsgx13VKTlGJ1cCowwq7AOT+3ICcVosQYhxmlSUAWRIkJXElwYk5OpYJXfl5qMQ6H4FKBcEkyDp3JeNRjdzC6DNSdsQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Applicant Name" = _t, #"Application Status" = _t, Date = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Applicant Name", type text}, {"Application Status", type text}, {"Date", type date}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"Applicant Name", "Application Status"}, {{"Count", each Table.AddIndexColumn(_,"Index",1,1),type table}}),
#"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"Date", "Index"}, {"Date", "Index"}),
#"Added Custom" = Table.AddColumn(#"Expanded Count", "Custom", each let a=List.Max(Table.SelectRows(#"Expanded Count",(x)=>x[Applicant Name]=[Applicant Name])[Date]),
b=Table.RowCount(Table.SelectRows(#"Expanded Count",(x)=>x[Application Status]="Completed" and x[Applicant Name]=[Applicant Name])),
c=if (b>0 and [Date]=a) or (b=0 and [Index]=1) then 1 else 0
in c),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] = 1)),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom"})
in
#"Removed Columns"
Output
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Another option would be to pivot on the funnel stage with a "count" summarization, then your table should have name as a unique key and a set of columns with the number of applications in each stage for each person (assuming you don't have any other application-specific columns hanging on that hose it up). You can then do a simple conditional column for "highest funnel stage completed" or something.
Hi @08152023
Can you provide some sample data , the information you have offered is little , I don't know the structure of your data so that I can't reproduce the problem you're experiencing.
Best Regards!
Yolo Zhu
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.