March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
I have a rather big table with customer first and last names and need to check if they are similar, so I can then check if they are actually different people or just entered with typos.
If I do left() and right() with low numbers on the first and last name columns - as there is no LIKE operator in PBI and so I can hopefully catch most of the entries I want to filter for - the query takes way too long to complete because it has to go through so many entries and variations to evaluate.
Any ideas?
Ok, I think I found a way to work around this, in a way, with the levenshtein likeness calculation.
I managed to implement it as a custom function in the query editor as
---
= (s1 as nullable text, s2 as nullable text) =>
if s1 = null or s2 = null then "" else
let
l1 = Text.Length(s1),
l2 = Text.Length(s2),
d = List.Generate(
() => [i=0, j=0, cost=0],
each [i] <= l1,
each if [j] = l2 then [i]+1 else null,
each
if [i] = 0 then
[i=1, j=1, cost=if s1{0} = s2{0} then 0 else 1]
else if s1{[i]-1} = s2{[j]-1} then
[i=[i]+1, j=[j]+1, cost=[cost]]
else
let
ins = [i=[i], j=[j]+1, cost=[cost]+1],
del = [i=[i]+1, j=[j], cost=[cost]+1],
sub = [i=[i]+1, j=[j]+1, cost=[cost]+1]
in
List.Min({ins, del, sub}, each [i]<=l1 and [j]<=l2)
)
in
List.Last(d)[cost]
---
but I am unsure how to actually use it to add a new calculated column, as I need to compare the value in [name] to the earlier entry(entries) in the same column, filtered by a shared value in a different column.
I tried sorting the whole thing by the filtering column and then use
= Table.AddColumn(#"Hinzugefügter Index", "Name_likeness", each LSTEIN([name], List.Range([name],[Index]-1,1)))
but that got me an error message.
You could also try out using Fuzzy Merge or Cluster values in Power Query, see https://learn.microsoft.com/en-us/power-query/fuzzy-matching
I've had mixed success for that myself, but if using Excel isn't too far out of scope for you I always get reliable fuzzy matching results using the free Fuzzy Lookup add in for Excel https://www.microsoft.com/en-nz/download/details.aspx?id=15011
I don't see the cluster option in my menu.
Oh sorry, I haven't used cluster values myself, just saw the cluster values feature is available only for Power Query Online.
The fuzzy merge option is available in vanilla Power Query when merging queries though. I haven't found it very reliable compared to the excel add in I mentioned, but your mileage may vary.
The Power Query fuzzy merge is probably ok for names matching, but isn't very good at identifying similarity for e.g where one string is a keyword and the other string is a longer text to search. The excel add in seems to perform well for both scenarios.
https://numidiabi.wordpress.com/2020/08/24/power-query-table-addfuzzyclustercolumn/
gave me some ideas, but without a "correct" column to match the names against this doesn't work I think.
You can use the standard merge data interface and select the same table for both sides of the join using the fuzzy match option.
Good idea, but the resulting merge query (I have 717k lines in the table I need to check) wasn't done even after letting it run for ten minutes.
Ouch, yep, that will take a while.
One suggestion, if your customer names are in a fact table, try duplicating the query, remove columns other than the customer name (and any other customer ID if you have one e.g email) and deduplicate. I don't know how many times a customer is repeated throughout your data, but that could considerably reduce the amount of records power query has to compare.
The de-duplicate did reduce the lines a lot, but still far from a manageable amount sadly.
I'm trying to manually add in the column - as that seems to work according to another post - according to https://learn.microsoft.com/en-us/powerquery-m/table-addfuzzyclustercolumn , but I didn't have much success yet.
My data source is mysql that gets directly imported into Power BI, so Excel isn't really an option sadly.
Hi @zahlenschubser,
You can take a look at the following blog about string comparison from SQL to Dax if help with you scenarios:
From SQL to DAX: String Comparison
Regards,
Xiaoxin Sheng
I tried to use the ideas from that one, but the necessary granularity makes the resulting query take way too long sadly.
Like this for example. The first and last two are pretty sure the same person, but last and first name(s) are written differently. It's also often a combination of typos, missing letters, or swapped letters, like v <> w.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
114 | |
76 | |
57 | |
52 | |
44 |
User | Count |
---|---|
165 | |
116 | |
63 | |
57 | |
50 |