Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi, I am trying to calculate the Levenshtein distance between two text values. There's no built in way to do this that I know of, so I rolled my own function but it is very slow. It requires iterating through a list (really a matrix) and changing the elements one at a time while referencing previous elements at each step.
Is there a way I can make this more efficient? The code I'm using is below, with both List.Generate and List.Accumulate versions (the latter is the version commented out).
let
levendist = (a as text, b as text) as number =>
let
lena = Text.Length(a),
lenb = Text.Length(b),
olist = List.Buffer(List.Generate(
() => [i=0,j=0,v=0],
each ((lena+1)*[j] + [i]) < ((lena+1) * (lenb+1)),
each [ i = if [i]+1 = (lena+1) then 0 else [i]+1,
j = if [i]+1 = (lena+1) then [j]+1 else [j],
v = if [j] = 0 and [i]+1 <= lena then [i]+1
else if [i] = lena then [j]+1
else 0 ])),
ulist = List.Generate(
() => [i=0, l=olist],
each [i] <= List.Count(olist),
(x) =>
let
curlist = x[l],
current = curlist{x[i]},
next = List.ReplaceRange(
curlist,
x[i],
1,
{
if current[i] = 0 or current[j] = 0 then current[v]
else List.Min({
curlist{current[i] - 1 + (current[j] * (lena+1))} + 1,
curlist{current[i] + ((current[j] - 1) * (lena+1))} + 1,
if Text.Upper(Text.At(a,current[i] - 1)) = Text.Upper(Text.At(b,current[j] - 1))
then curlist{current[i] - 1 + ((current[j] - 1) * (lena+1))}
else curlist{current[i] - 1 + ((current[j] - 1) * (lena+1))} + 1
})
}
)
in
[i=x[i]+1, l=next],
each [l]
)
/*ulist = List.Accumulate(olist, {},
(state, current) =>
List.Combine({
state,
{
if current[i] = 0 or current[j] = 0 then current[v]
else List.Min({
state{current[i] - 1 + (current[j] * (lena+1))} + 1,
state{current[i] + ((current[j] - 1) * (lena+1))} + 1,
if Text.Upper(Text.At(a,current[i] - 1)) = Text.Upper(Text.At(b,current[j] - 1))
then state{current[i] - 1 + ((current[j] - 1) * (lena+1))}
else state{current[i] - 1 + ((current[j] - 1) * (lena+1))} + 1
})
}
})
)*/
in
List.Last(List.Last(ulist))
// List.Last(ulist)
in
levendist
(Is there a way to paste code here without the alignment having to be manually fixed?)
Edited to add: here are some results of the function
Text A | Text B | Expected Result |
Kitten | Sitting | 3 |
Saturday | Sunday | 3 |
Moop | Meep | 2 |
Hello | Goodbye | 7 |
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8s4sKUnNU9JRCgYyMvPSgSxjpVidaKXgxJLSopTESpBUaR6EAZHxzc8vAHJ8U1NBlBFYzCM1JycfyHPPz09JqkwFssyVYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Text A" = _t, #"Text B" = _t, #"Expected Result" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text A", type text}, {"Text B", type text}, {"Expected Result", Int64.Type}})
in
#"Changed Type"
Solved! Go to Solution.
I've found this answer here:
"I would strongly recommend you to not use PowerQuery or VBA for this. There are much much much better libraries in both R and Python for implementing this methodology."
Text A | Text B | Result |
Kitten | Sitting | 3 |
Saturday | Sunday | 3 |
Moop | Meep | 2 |
Hello | Goodbye | 7 |
Results of the current version are fine - it's just speed that I'm not happy with.
I'm a little confused - the table I provided is the sample data (plus the result of Table.AddColumn(Source, "Result", each LevenshteinDist([Text A],[Text B])) ).
Edit:
If you prefer, there's this:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8s4sKUnNU9JRCgYyMvPSlWJ1opWCE0tKi1ISK0HCpXkgBkjUNz+/ACjim5paAOZ7pObk5AMF3PPzU5IqU5ViYwE=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Text A" = _t, #"Text B" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text A", type text}, {"Text B", type text}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Result", each LevenshteinDist([Text A],[Text B]))
in
#"Added Custom"
My data consists of rows by company, reference, date and group. Some of the rows are missing the group, and I'm trying to populate this by finding earlier rows (by date) with a similar reference within the same company.
To do this I first do a nested self join on company, then I filter each nested table to remove dates after the current row's date, and rows missing the group. Then, for each row missing the group I calculate the Levenshtein distance between its reference and the reference of each row of the associated nested table. Finally I decide which row has the lowest distance, and keep that row (taking the first in case of ties).
I can't do this using a fuzzy merge because of the condition of matching by company and removing later dates. I also found clustering didn't help because the similarity algorithm that it uses doesn't seem to be granular enough (same would go for merging I guess).
I am blocked from uploading data to file sharing sites, but the following script covers the entire process and includes sample data. LevenshteinDist is the custom function from the original post.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jVNNc4IwEP0rGc/WfKAix7anTnvRHnpwPGTiFlMlMBC09td3IYRGcZwe2EmWfW/f2yTr9ehxNB4tl3E04aSAssoNsWAqnRvMC8ZnlEU04s2mC5uxA63ehZhEJKsrrUgmjUyh7DFTGjHc8NgFj3l75YxPYqLN50FabEJUnmXaWgAHjSmbUyFwM41d8NArjTrdWQdJKIudwkt9PSAvDnBhSnDKuBN4z5SWXbmgTFCx+JcfCzJzqKixwjpdnZEnXOqsKPMjZGAssflpMGceueDrV1DU1pEX8tzAglG1JpL28/UvAb/S9txX/00pYA+qK5QTIBa963t6+jOIenYn5bnrywV5ICe9TcHiLdn7OxJTLhxiNnMhAC3YvAdVuKqLrbSwJUd5qKFX17m/hW+bVuhM7bGtKutqByWmDJwIZre1sp7Gy4iFCwOaD6e9OVjcSZPbhkvtpEnhmmSgJLmtBHOY+gLVuNIGHwH4Qd7RkwxmGcymxFhdkwxnIxjjgaJmut/nH9n9a9/F7e7Ie2kl95jAfIPY/AI=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Company = _t, Reference = _t, Date = _t, Group = _t, #"Correct group" = _t]),
#"Replaced Value" = Table.ReplaceValue(Source,"",null,Replacer.ReplaceValue,{"Group"}),
#"Changed Type" = Table.TransformColumnTypes(#"Replaced Value",{{"Company", type text}, {"Reference", type text}, {"Date", type date}, {"Group", type nullable text}, {"Correct group", type nullable text}}),
#"Filter for rows missing group" = Table.SelectRows(#"Changed Type", each ([Group] = null)),
#"Self join" = Table.NestedJoin(#"Filter for rows missing group", "Company", #"Changed Type", "Company", "AllData"),
#"Filter for usable rows" = Table.AddColumn(#"Self join", "AllData filtered", each Table.SelectRows(_[AllData], (x) => (x[Date] < _[Date]) and (x[Group] <> null))),
#"Push reference to nested table" = Table.AddColumn(#"Filter for usable rows", "AllData with ref", each Table.AddColumn(_[AllData filtered], "Outer ref", (x) => _[Reference])),
#"Get Levenshtein Distance" = Table.TransformColumns(#"Push reference to nested table", {{"AllData with ref", each Table.AddColumn(_, "LD", (x) => LevenshteinDist(x[Reference], x[Outer ref]))}}),
#"Get best group" = Table.TransformColumns(#"Get Levenshtein Distance", {{"AllData with ref", each
let
m = List.Min(_[LD]),
filter = Table.SelectRows(_, (x) => x[LD] = m),
v = filter{0}[Group]
in
v}})
in
#"Get best group"
I've found this answer here:
"I would strongly recommend you to not use PowerQuery or VBA for this. There are much much much better libraries in both R and Python for implementing this methodology."
The sample data is a CSV or .XLSX that is called upon as the Source in your work in progress .PBIX
Please provide your work-in-progress Power BI Desktop file (with sensitive information removed) that covers your issue or question completely in a usable format (not as a screenshot). You can upload the PBIX file to a cloud storage service such as OneDrive, Google Drive, Dropbox, or to a Github repository, and then share a file’s URL.
https://community.fabric.microsoft.com/t5/Community-Blog/How-to-provide-sample-data-in-the-Power-BI-...
Please show the expected outcome based on the sample data you provided.
https://community.fabric.microsoft.com/t5/Desktop/How-to-Get-Your-Question-Answered-Quickly/m-p/1447...
This allows members of the Forum to assess the state of the model, report layer, relationships, and any DAX applied.
This data covers it:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8s4sKUnNU9JRCgYyMvPSgSxjpVidaKXgxJLSopTESpBUaR6EAZHxzc8vAHJ8U1NBlBFYzCM1JycfyHPPz09JqkwFssyVYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Text A" = _t, #"Text B" = _t, #"Expected Result" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text A", type text}, {"Text B", type text}, {"Expected Result", Int64.Type}})
in
#"Changed Type"
Check out the July 2025 Power BI update to learn about new features.