cancel
Showing results for
Did you mean:

Fabric is Generally Available. Browse Fabric Presentations. Work towards your Fabric certification with the Cloud Skills Challenge.

Frequent Visitor

## Measuring similarity between strings

I have on the order of one million strings where I want to find similar strings. I would like to use the Levinsthein Distance as the measure of similarity.

My first try was to do a left join of the table with itself using "fuzzy logic" but even if i set the threshold to 0 i don't get all possible combinations in the result!

Also the documentation on the fyzzy logic contains no info on how the similarity score is calculated! After som testing I verified that it is not even symetric!  comparing string 1 to string 2 does not always give the same similarity as comparing string 2 to string 1!

My next atempt was using power query. I implemented a recursive function that calculates the Levinsthein Distance. That works but the performance is so low that it is not viable for milions of records.

Finally I tried DAX. The problem here is that I can't implement the comparison in a recurive way since DAX does not support recursion. I tried an iterative implementation but i can't come up with a way to refere to values calculated in the previous iterations of the iterator function ADDCOLUMNS.

Any suggestions?

What methode does fuzzy matching and clustering use to calculate the distance? Why is it not mentioned in the documentation? It's rather relevant.
Why will a fuzzy left join not return all records when the threshold is set to 0?

What's the best tool for calculating the Levinsthein distance between all possible (10^12) pairs of a milion strings?

1 ACCEPTED SOLUTION
Super User

@mabl4367 wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

6 REPLIES 6
Super User

Generating a trillion pairs is challenging no matter what tool you use. What are you trying to do that requires all possible pairs?

Microsoft documentation says that "Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances." Wiki: https://en.wikipedia.org/wiki/Jaccard_index

Frequent Visitor

I have a system with more than two 2 000 000 vendors that needs cleenup. A lot of these vendors are duplicates and I wish to find candidate duplicates by comparing their names and adresses to give them a "duplicate score". The top scoring vendors could be removed automatically while lesser scoring vedors should be checked manually.

Super User

As long as Power Query is the tool you're using, I think clustering is likely your best option. See also:

Frequent Visitor

I have tried clustering but I still have no information on what measure is used to measure the string similarity. Also I initially don't want to reduce the data. I want to add a similarity score for each combination of all of the origial strings.

Right now I'm working on using Azure Data Bricks. Data bricks has a function for measuring the Levinsthein Distance.

Super User

@mabl4367 wrote:

I still have no information on what measure is used to measure the string similarity.

As I mentioned above, the measure used is Jaccard similarity.

Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,

https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data

Frequent Visitor

I have marked this topic as resolved but I still have one question left:

When I make a left join of a table with itself using "fuzzy logic" and set the threshold to 0 I still don't get all possible combinations in the result? How can that be explained?

When I set the threshold to 0 I would expect all records to match and that the result effectivly should be a crossjoin. What am I missing?

Announcements

#### Power BI Monthly Update - November 2023

Check out the November 2023 Power BI update to learn about new features.

#### Fabric Community News unified experience

Read the latest Fabric Community announcements, including updates on Power BI, Synapse, Data Factory and Data Activator.

#### The largest Power BI and Fabric virtual conference

130+ sessions, 130+ speakers, Product managers, MVPs, and experts. All about Power BI and Fabric. Attend online or watch the recordings.

Top Solution Authors
Top Kudoed Authors