Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply
Phantomiceman
Frequent Visitor

Scoring Duplicates in table

Hi I have been tasked to look for duplication on our  company database and come up with dashboard that will present it so we can use it for tracking and correction.

 

I am currently looking at companies in our database and finding what companies may have been duplicated, i have come up with an intial solution of taking a partial part of the name of the company and concenating the postcode together to form a match string and using the EARLIER and FILTER functions finding the records that return multiple rows that share the same match string. On the whole this appears to work however the results returned contain both postives and false postives too.

 

I want to therefore make this more intutiative in that the duplicates are scored based on their similarties so the more they match the higher they score. Depnding how high they score will determine the  type of correction required.

 

The sample data would look something like this

 

Company IDCompany NamePostcodeCreation Date
1ABC LTD AB11 7AH01/01/2023
2ABC LTD AB11 7AH02/01/2023
3John AND Smith LTDKT6 5PT01/01/2023
5John & Smith LTDKT6 5PT02/01/2023
6123 LTDBB1 8AL01/01/2023
7123 LTDM26 7HT02/01/2023
8XYZ and SonsBB1 8AL01/01/2023
9XYZ & SonsM26 7HT02/01/2023
10Joe Blogs Limited M1 8AB01/01/2023

 

I want these to to have different tiers of scoring so something like this

 

100%  = Company Name and Postcode Match exactly

75% =   Postcodes match but company names somewhat match

50% = Company Name matche eaxctly but Postcodes are different

25% = Company names somewhat match but poscodes are differrent

0% = Neither matches and the record is unique

 

I also want  to state which record is the orginal and which is the duplicate. Ive done this currently using rankx function finding records which are share the same match string orderd by the creation date so the ones created first are marked as orginal and those after are marked as duplicates.This allows us to track how many duplicates are being created during a time period.Im not  sure how this could be factored into the solution

 

The end result should be something like this

 

Company IDCompany NamePostcodeCreation DateMatch PercentageRecord Type
1ABC LTD AB11 7AH01/01/2023100Orginal
2ABC LTD AB11 7AH02/01/2023100Duplicate
3John AND Smith LTDKT6 5PT01/01/202375Orginal
5John & Smith LTDKT6 5PT02/01/202375Duplicate
6123 LTDBB1 8AL01/01/202350Orginal
7123 LTDM26 7HT02/01/202350Duplicate
8XYZ and SonsBB1 8AL01/01/202325Orginal
9XYZ & SonsM26 7HT02/01/202325Duplicate
10Joe Blogs Limited L1 8AB01/01/20230Orignal

 

Im hoping to create visuals and filters based on the scores so you can select records based on how well they match

 

Apologies but im not very familar with string matching using various string matching  techniques so you might need to simplify this for me

0 REPLIES 0

Helpful resources

Announcements
November Power BI Update Carousel

Power BI Monthly Update - November 2025

Check out the November 2025 Power BI update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Solution Authors
Top Kudoed Authors