Solved: Measuring similarity between strings

mabl4367 · ‎09-06-2022

I have on the order of one million strings where I want to find similar strings. I would like to use the Levinsthein Distance as the measure of similarity.

My first try was to do a left join of the table with itself using "fuzzy logic" but even if i set the threshold to 0 i don't get all possible combinations in the result!

Also the documentation on the fyzzy logic contains no info on how the similarity score is calculated! After som testing I verified that it is not even symetric! comparing string 1 to string 2 does not always give the same similarity as comparing string 2 to string 1!

My next atempt was using power query. I implemented a recursive function that calculates the Levinsthein Distance. That works but the performance is so low that it is not viable for milions of records.

Finally I tried DAX. The problem here is that I can't implement the comparison in a recurive way since DAX does not support recursion. I tried an iterative implementation but i can't come up with a way to refere to values calculated in the previous iterations of the iterator function ADDCOLUMNS.

Any suggestions?

What methode does fuzzy matching and clustering use to calculate the distance? Why is it not mentioned in the documentation? It's rather relevant.
Why will a fuzzy left join not return all records when the threshold is set to 0?

What's the best tool for calculating the Levinsthein distance between all possible (10^12) pairs of a milion strings?

AlexisOlson · ‎09-12-2022

@mabl4367 wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

View solution in original post

AlexisOlson · ‎09-06-2022

Generating a trillion pairs is challenging no matter what tool you use. What are you trying to do that requires all possible pairs?

Microsoft documentation says that "Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances." Wiki: https://en.wikipedia.org/wiki/Jaccard_index

mabl4367 · ‎09-06-2022

I have a system with more than two 2 000 000 vendors that needs cleenup. A lot of these vendors are duplicates and I wish to find candidate duplicates by comparing their names and adresses to give them a "duplicate score". The top scoring vendors could be removed automatically while lesser scoring vedors should be checked manually.

AlexisOlson · ‎09-07-2022

As long as Power Query is the tool you're using, I think clustering is likely your best option. See also:

https://radacad.com/fuzzy-clustering-in-power-bi-using-power-query-finding-similar-values

mabl4367 · ‎09-12-2022

I have tried clustering but I still have no information on what measure is used to measure the string similarity. Also I initially don't want to reduce the data. I want to add a similarity score for each combination of all of the origial strings.

Right now I'm working on using Azure Data Bricks. Data bricks has a function for measuring the Levinsthein Distance.

AlexisOlson · ‎09-12-2022

@mabl4367 wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

mabl4367 · ‎09-13-2022

I have marked this topic as resolved but I still have one question left:

When I make a left join of a table with itself using "fuzzy logic" and set the threshold to 0 I still don't get all possible combinations in the result? How can that be explained?

When I set the threshold to 0 I would expect all records to match and that the result effectivly should be a crossjoin. What am I missing?

Measuring similarity between strings

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024

Join us at the 2025 Microsoft Fabric Community Conference