Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn the coveted Fabric Analytics Engineer certification. 100% off your exam for a limited time only!

Reply
mabl4367
Frequent Visitor

Measuring similarity between strings

I have on the order of one million strings where I want to find similar strings. I would like to use the Levinsthein Distance as the measure of similarity.

My first try was to do a left join of the table with itself using "fuzzy logic" but even if i set the threshold to 0 i don't get all possible combinations in the result!

Also the documentation on the fyzzy logic contains no info on how the similarity score is calculated! After som testing I verified that it is not even symetric!  comparing string 1 to string 2 does not always give the same similarity as comparing string 2 to string 1!

My next atempt was using power query. I implemented a recursive function that calculates the Levinsthein Distance. That works but the performance is so low that it is not viable for milions of records.

Finally I tried DAX. The problem here is that I can't implement the comparison in a recurive way since DAX does not support recursion. I tried an iterative implementation but i can't come up with a way to refere to values calculated in the previous iterations of the iterator function ADDCOLUMNS.

Any suggestions?

What methode does fuzzy matching and clustering use to calculate the distance? Why is it not mentioned in the documentation? It's rather relevant.
Why will a fuzzy left join not return all records when the threshold is set to 0?

What's the best tool for calculating the Levinsthein distance between all possible (10^12) pairs of a milion strings?





1 ACCEPTED SOLUTION


@mabl4367 wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

 

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

View solution in original post

6 REPLIES 6
AlexisOlson
Super User
Super User

Generating a trillion pairs is challenging no matter what tool you use. What are you trying to do that requires all possible pairs?

 

Microsoft documentation says that "Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances." Wiki: https://en.wikipedia.org/wiki/Jaccard_index

I have a system with more than two 2 000 000 vendors that needs cleenup. A lot of these vendors are duplicates and I wish to find candidate duplicates by comparing their names and adresses to give them a "duplicate score". The top scoring vendors could be removed automatically while lesser scoring vedors should be checked manually.

As long as Power Query is the tool you're using, I think clustering is likely your best option. See also:

https://radacad.com/fuzzy-clustering-in-power-bi-using-power-query-finding-similar-values

I have tried clustering but I still have no information on what measure is used to measure the string similarity. Also I initially don't want to reduce the data. I want to add a similarity score for each combination of all of the origial strings.

Right now I'm working on using Azure Data Bricks. Data bricks has a function for measuring the Levinsthein Distance.


@mabl4367 wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

 

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

I have marked this topic as resolved but I still have one question left:

When I make a left join of a table with itself using "fuzzy logic" and set the threshold to 0 I still don't get all possible combinations in the result? How can that be explained?

When I set the threshold to 0 I would expect all records to match and that the result effectivly should be a crossjoin. What am I missing?

Helpful resources

Announcements
April AMA free

Microsoft Fabric AMA Livestream

Join us Tuesday, April 09, 9:00 – 10:00 AM PST for a live, expert-led Q&A session on all things Microsoft Fabric!

March Fabric Community Update

Fabric Community Update - March 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors