March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
I have on the order of one million strings where I want to find similar strings. I would like to use the Levinsthein Distance as the measure of similarity.
My first try was to do a left join of the table with itself using "fuzzy logic" but even if i set the threshold to 0 i don't get all possible combinations in the result!
Also the documentation on the fyzzy logic contains no info on how the similarity score is calculated! After som testing I verified that it is not even symetric! comparing string 1 to string 2 does not always give the same similarity as comparing string 2 to string 1!
My next atempt was using power query. I implemented a recursive function that calculates the Levinsthein Distance. That works but the performance is so low that it is not viable for milions of records.
Finally I tried DAX. The problem here is that I can't implement the comparison in a recurive way since DAX does not support recursion. I tried an iterative implementation but i can't come up with a way to refere to values calculated in the previous iterations of the iterator function ADDCOLUMNS.
Any suggestions?
What methode does fuzzy matching and clustering use to calculate the distance? Why is it not mentioned in the documentation? It's rather relevant.
Why will a fuzzy left join not return all records when the threshold is set to 0?
What's the best tool for calculating the Levinsthein distance between all possible (10^12) pairs of a milion strings?
Solved! Go to Solution.
@mabl4367 wrote:
I still have no information on what measure is used to measure the string similarity.
As I mentioned above, the measure used is Jaccard similarity.
Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,
https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data
Generating a trillion pairs is challenging no matter what tool you use. What are you trying to do that requires all possible pairs?
Microsoft documentation says that "Power Query uses the Jaccard similarity algorithm to measure the similarity between pairs of instances." Wiki: https://en.wikipedia.org/wiki/Jaccard_index
I have a system with more than two 2 000 000 vendors that needs cleenup. A lot of these vendors are duplicates and I wish to find candidate duplicates by comparing their names and adresses to give them a "duplicate score". The top scoring vendors could be removed automatically while lesser scoring vedors should be checked manually.
As long as Power Query is the tool you're using, I think clustering is likely your best option. See also:
https://radacad.com/fuzzy-clustering-in-power-bi-using-power-query-finding-similar-values
I have tried clustering but I still have no information on what measure is used to measure the string similarity. Also I initially don't want to reduce the data. I want to add a similarity score for each combination of all of the origial strings.
Right now I'm working on using Azure Data Bricks. Data bricks has a function for measuring the Levinsthein Distance.
@mabl4367 wrote:
I still have no information on what measure is used to measure the string similarity.
As I mentioned above, the measure used is Jaccard similarity.
Calculating the similarity between all pairs of 2 million strings means 4 trillion comparisons. This will take a lot of computing time and Power Query likely isn't the best tool for this particular job. You'll probably need some optimization techniques to make this more tractable. See this approach for example,
https://hpccsystems.com/blog/edit-distances-and-optimized-prefix-trees-big-data
I have marked this topic as resolved but I still have one question left:
When I make a left join of a table with itself using "fuzzy logic" and set the threshold to 0 I still don't get all possible combinations in the result? How can that be explained?
When I set the threshold to 0 I would expect all records to match and that the result effectivly should be a crossjoin. What am I missing?
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
21 | |
16 | |
13 | |
12 | |
9 |
User | Count |
---|---|
36 | |
31 | |
20 | |
19 | |
17 |