Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
zahlenschubser
Helper III
Helper III

LIKE operator & customer name analysis

I have a rather big table with customer first and last names and need to check if they are similar, so I can then check if they are actually different people or just entered with typos.

 

If I do left() and right() with low numbers on the first and last name columns - as there is no LIKE operator in PBI and so I can hopefully catch most of the entries I want to filter for - the query takes way too long to complete because it has to go through so many entries and variations to evaluate.

 

Any ideas?

14 REPLIES 14
zahlenschubser
Helper III
Helper III

Ok, I think I found a way to work around this, in a way, with the levenshtein likeness calculation.

I managed to implement it as a custom function in the query editor as
---
= (s1 as nullable text, s2 as nullable text) =>
if s1 = null or s2 = null then "" else
let
l1 = Text.Length(s1),
l2 = Text.Length(s2),
d = List.Generate(
() => [i=0, j=0, cost=0],
each [i] <= l1,
each if [j] = l2 then [i]+1 else null,
each
if [i] = 0 then
[i=1, j=1, cost=if s1{0} = s2{0} then 0 else 1]
else if s1{[i]-1} = s2{[j]-1} then
[i=[i]+1, j=[j]+1, cost=[cost]]
else
let
ins = [i=[i], j=[j]+1, cost=[cost]+1],
del = [i=[i]+1, j=[j], cost=[cost]+1],
sub = [i=[i]+1, j=[j]+1, cost=[cost]+1]
in
List.Min({ins, del, sub}, each [i]<=l1 and [j]<=l2)
)
in
List.Last(d)[cost]
---

but I am unsure how to actually use it to add a new calculated column, as I need to compare the value in [name] to the earlier entry(entries) in the same column, filtered by a shared value in a different column.

I tried sorting the whole thing by the filtering column and then use
= Table.AddColumn(#"Hinzugefügter Index", "Name_likeness", each LSTEIN([name], List.Range([name],[Index]-1,1)))
but that got me an error message.

halfglassdarkly
Responsive Resident
Responsive Resident

You could also try out using Fuzzy Merge or Cluster values in Power Query, see https://learn.microsoft.com/en-us/power-query/fuzzy-matching 

 

I've had mixed success for that myself, but if using Excel isn't too far out of scope for you I always get reliable fuzzy matching results using the free Fuzzy Lookup add in for Excel https://www.microsoft.com/en-nz/download/details.aspx?id=15011 

I don't see the cluster option in my menu.

zahlenschubser_0-1679988594047.png

 

Oh sorry, I haven't used cluster values myself, just saw the cluster values feature is available only for Power Query Online.

 

The fuzzy merge option is available in vanilla Power Query when  merging queries though. I haven't found it very reliable compared to the excel add in I mentioned, but your mileage may vary.

 

The Power Query fuzzy merge is probably ok for names matching, but isn't very good at identifying similarity for e.g where one string is a keyword and the other string is a longer text to search. The excel add in seems to perform well for both scenarios.

https://numidiabi.wordpress.com/2020/08/24/power-query-table-addfuzzyclustercolumn/

gave me some ideas, but without a "correct" column to match the names against this doesn't work I think.

You can use the standard merge data interface and select the same table for both sides of the join using the fuzzy match option.

Good idea, but the resulting merge query (I have 717k lines in the table I need to check) wasn't done even after letting it run for ten minutes.

Ouch, yep, that will take a while.

One suggestion, if your customer names are in a fact table, try duplicating the query, remove columns other than the customer name (and any other customer ID if you have one e.g email) and deduplicate. I don't know how many times a customer is repeated throughout your data, but that could considerably reduce the amount of records power query has to compare.

The de-duplicate did reduce the lines a lot, but still far from a manageable amount sadly.

I'm trying to manually add in the column - as that seems to work according to another post - according to https://learn.microsoft.com/en-us/powerquery-m/table-addfuzzyclustercolumn , but I didn't have much success yet.

 

My data source is mysql that gets directly imported into Power BI, so Excel isn't really an option sadly.

v-shex-msft
Community Support
Community Support

Hi @zahlenschubser,

You can take a look at the following blog about string comparison from SQL to Dax if help with you scenarios:

From SQL to DAX: String Comparison 

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

I tried to use the ideas from that one, but the necessary granularity makes the resulting query take way too long sadly.

rajulshah
Memorable Member
Memorable Member

@zahlenschubser Can you please provide any sample data?

Like this for example. The first and last two are pretty sure the same person, but last and first name(s) are written differently. It's also often a combination of typos, missing letters, or swapped letters, like v <> w.

zahlenschubser_1-1679565553371.png

 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.