Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM. Register now.

Reply
Anonymous
Not applicable

Measuring similarity between strings

I have on the order of one million strings where I want to find similar strings. I would like to use the Levinsthein Distance as the measure of similarity.

My first try was to do a left join of the table with itself using "fuzzy logic" but even if i set the threshold to 0 i don't get all possible combinations in the result!

Also the documentation on the fyzzy logic contains no info on how the similarity score is calculated! After som testing I verified that it is not even symetric!  comparing string 1 to string 2 does not always give the same similarity as comparing string 2 to string 1!

My next atempt was using power query. I implemented a recursive function that calculates the Levinsthein Distance. That works but the performance is so low that it is not viable for milions of records.

Finally I tried DAX. The problem here is that I can't implement the comparison in a recurive way since DAX does not support recursion. I tried an iterative implementation but i can't come up with a way to refere to values calculated in the previous iterations of the iterator function ADDCOLUMNS.

Any suggestions?

What methode does fuzzy matching and clustering use to calculate the distance? Why is it not mentioned in the documentation? It's rather relevant.
Why will a fuzzy left join not return all records when the threshold is set to 0?

What's the best tool for calculating the Levinsthein distance between all possible (10^12) pairs of a milion strings?





1 ACCEPTED SOLUTION


@Anonymous wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

 

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

View solution in original post

6 REPLIES 6
AlexisOlson
Super User
Super User

Generating a trillion pairs is challenging no matter what tool you use. What are you trying to do that requires all possible pairs?

 

Microsoft documentation says that "Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances." Wiki: https://en.wikipedia.org/wiki/Jaccard_index

Anonymous
Not applicable

I have a system with more than two 2 000 000 vendors that needs cleenup. A lot of these vendors are duplicates and I wish to find candidate duplicates by comparing their names and adresses to give them a "duplicate score". The top scoring vendors could be removed automatically while lesser scoring vedors should be checked manually.

As long as Power Query is the tool you're using, I think clustering is likely your best option. See also:

https://radacad.com/fuzzy-clustering-in-power-bi-using-power-query-finding-similar-values

Anonymous
Not applicable

I have tried clustering but I still have no information on what measure is used to measure the string similarity. Also I initially don't want to reduce the data. I want to add a similarity score for each combination of all of the origial strings.

Right now I'm working on using Azure Data Bricks. Data bricks has a function for measuring the Levinsthein Distance.


@Anonymous wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

 

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

Anonymous
Not applicable

I have marked this topic as resolved but I still have one question left:

When I make a left join of a table with itself using "fuzzy logic" and set the threshold to 0 I still don't get all possible combinations in the result? How can that be explained?

When I set the threshold to 0 I would expect all records to match and that the result effectivly should be a crossjoin. What am I missing?

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

October Power BI Update Carousel

Power BI Monthly Update - October 2025

Check out the October 2025 Power BI update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Kudoed Authors