Solved: Efficiently iterate over and modify a list

MBZA · ‎07-08-2024

Hi, I am trying to calculate the Levenshtein distance between two text values. There's no built in way to do this that I know of, so I rolled my own function but it is very slow. It requires iterating through a list (really a matrix) and changing the elements one at a time while referencing previous elements at each step.

Is there a way I can make this more efficient? The code I'm using is below, with both List.Generate and List.Accumulate versions (the latter is the version commented out).

let
    levendist = (a as text, b as text) as number =>
    let 
        lena = Text.Length(a),
        lenb = Text.Length(b),
        olist = List.Buffer(List.Generate(
            () => [i=0,j=0,v=0], 
            each ((lena+1)*[j] + [i]) < ((lena+1) * (lenb+1)),
            each [  i = if [i]+1 = (lena+1) then 0 else [i]+1,
                    j = if [i]+1 = (lena+1) then [j]+1 else [j],
                    v = if [j] = 0 and [i]+1 <= lena then [i]+1
                        else if [i] = lena then [j]+1 
                        else 0 ])),
        ulist = List.Generate(
            () => [i=0, l=olist],
            each [i] <= List.Count(olist),
            (x) => 
                let 
                    curlist = x[l],
                    current = curlist{x[i]},
                    next = List.ReplaceRange(
                        curlist,
                        x[i],
                        1,
                        {
                            if current[i] = 0 or current[j] = 0 then current[v]
                            else List.Min({
                                    curlist{current[i] - 1 + (current[j] * (lena+1))} + 1,
                                    curlist{current[i] + ((current[j] - 1) * (lena+1))} + 1,
                                    if Text.Upper(Text.At(a,current[i] - 1)) = Text.Upper(Text.At(b,current[j] - 1))
                                        then curlist{current[i] - 1 + ((current[j] - 1) * (lena+1))} 
                                        else curlist{current[i] - 1 + ((current[j] - 1) * (lena+1))} + 1
                            })
                        }
                    )
                in 
                    [i=x[i]+1, l=next],
            each [l]                    
        )
        /*ulist = List.Accumulate(olist, {}, 
            (state, current) => 
                List.Combine({
                    state, 
                    {
                        if current[i] = 0 or current[j] = 0 then current[v]
                        else List.Min({
                                state{current[i] - 1 + (current[j] * (lena+1))} + 1,
                                state{current[i] + ((current[j] - 1) * (lena+1))} + 1,
                                if Text.Upper(Text.At(a,current[i] - 1)) = Text.Upper(Text.At(b,current[j] - 1))
                                    then state{current[i] - 1 + ((current[j] - 1) * (lena+1))} 
                                    else state{current[i] - 1 + ((current[j] - 1) * (lena+1))} + 1
                            })
                    }   
                })
            )*/
    in 
        List.Last(List.Last(ulist))
		// List.Last(ulist)
in
    levendist

(Is there a way to paste code here without the alignment having to be manually fixed?)

Edited to add: here are some results of the function

Text A	Text B	Expected Result
Kitten	Sitting	3
Saturday	Sunday	3
Moop	Meep	2
Hello	Goodbye	7

let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8s4sKUnNU9JRCgYyMvPSgSxjpVidaKXgxJLSopTESpBUaR6EAZHxzc8vAHJ8U1NBlBFYzCM1JycfyHPPz09JqkwFssyVYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Text A" = _t, #"Text B" = _t, #"Expected Result" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text A", type text}, {"Text B", type text}, {"Expected Result", Int64.Type}})
in
#"Changed Type"

dufoq3 · ‎07-08-2024

I've found this answer here:

"I would strongly recommend you to not use PowerQuery or VBA for this. There are much much much better libraries in both R and Python for implementing this methodology."

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

View solution in original post

dufoq3 · ‎07-08-2024

Hi @MBZA, could you provide sample data and expected result please?

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

MBZA · ‎07-08-2024

Text A	Text B	Result
Kitten	Sitting	3
Saturday	Sunday	3
Moop	Meep	2
Hello	Goodbye	7

Results of the current version are fine - it's just speed that I'm not happy with.

dufoq3 · ‎07-08-2024

What about sample data? 🙂

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

MBZA · ‎07-08-2024

I'm a little confused - the table I provided is the sample data (plus the result of Table.AddColumn(Source, "Result", each LevenshteinDist([Text A],[Text B])) ).

Edit:

If you prefer, there's this:

let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8s4sKUnNU9JRCgYyMvPSlWJ1opWCE0tKi1ISK0HCpXkgBkjUNz+/ACjim5paAOZ7pObk5AMF3PPzU5IqU5ViYwE=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Text A" = _t, #"Text B" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text A", type text}, {"Text B", type text}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Result", each LevenshteinDist([Text A],[Text B]))
in
#"Added Custom"

dufoq3 · ‎07-08-2024

OK, I see, but you should explain what exactly you want to achieve (and how). I see the result but maybe there is another way to achieve same result more efficiently.

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

MBZA · ‎07-08-2024

My data consists of rows by company, reference, date and group. Some of the rows are missing the group, and I'm trying to populate this by finding earlier rows (by date) with a similar reference within the same company.

To do this I first do a nested self join on company, then I filter each nested table to remove dates after the current row's date, and rows missing the group. Then, for each row missing the group I calculate the Levenshtein distance between its reference and the reference of each row of the associated nested table. Finally I decide which row has the lowest distance, and keep that row (taking the first in case of ties).

I can't do this using a fuzzy merge because of the condition of matching by company and removing later dates. I also found clustering didn't help because the similarity algorithm that it uses doesn't seem to be granular enough (same would go for merging I guess).

I am blocked from uploading data to file sharing sites, but the following script covers the entire process and includes sample data. LevenshteinDist is the custom function from the original post.

let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jVNNc4IwEP0rGc/WfKAix7anTnvRHnpwPGTiFlMlMBC09td3IYRGcZwe2EmWfW/f2yTr9ehxNB4tl3E04aSAssoNsWAqnRvMC8ZnlEU04s2mC5uxA63ehZhEJKsrrUgmjUyh7DFTGjHc8NgFj3l75YxPYqLN50FabEJUnmXaWgAHjSmbUyFwM41d8NArjTrdWQdJKIudwkt9PSAvDnBhSnDKuBN4z5SWXbmgTFCx+JcfCzJzqKixwjpdnZEnXOqsKPMjZGAssflpMGceueDrV1DU1pEX8tzAglG1JpL28/UvAb/S9txX/00pYA+qK5QTIBa963t6+jOIenYn5bnrywV5ICe9TcHiLdn7OxJTLhxiNnMhAC3YvAdVuKqLrbSwJUd5qKFX17m/hW+bVuhM7bGtKutqByWmDJwIZre1sp7Gy4iFCwOaD6e9OVjcSZPbhkvtpEnhmmSgJLmtBHOY+gLVuNIGHwH4Qd7RkwxmGcymxFhdkwxnIxjjgaJmut/nH9n9a9/F7e7Ie2kl95jAfIPY/AI=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Company = _t, Reference = _t, Date = _t, Group = _t, #"Correct group" = _t]),
#"Replaced Value" = Table.ReplaceValue(Source,"",null,Replacer.ReplaceValue,{"Group"}),
#"Changed Type" = Table.TransformColumnTypes(#"Replaced Value",{{"Company", type text}, {"Reference", type text}, {"Date", type date}, {"Group", type nullable text}, {"Correct group", type nullable text}}),
#"Filter for rows missing group" = Table.SelectRows(#"Changed Type", each ([Group] = null)),
#"Self join" = Table.NestedJoin(#"Filter for rows missing group", "Company", #"Changed Type", "Company", "AllData"),
#"Filter for usable rows" = Table.AddColumn(#"Self join", "AllData filtered", each Table.SelectRows(_[AllData], (x) => (x[Date] < _[Date]) and (x[Group] <> null))),
#"Push reference to nested table" = Table.AddColumn(#"Filter for usable rows", "AllData with ref", each Table.AddColumn(_[AllData filtered], "Outer ref", (x) => _[Reference])),
#"Get Levenshtein Distance" = Table.TransformColumns(#"Push reference to nested table", {{"AllData with ref", each Table.AddColumn(_, "LD", (x) => LevenshteinDist(x[Reference], x[Outer ref]))}}),
#"Get best group" = Table.TransformColumns(#"Get Levenshtein Distance", {{"AllData with ref", each 
let 
m = List.Min(_[LD]),
filter = Table.SelectRows(_, (x) => x[LD] = m),
v = filter{0}[Group]
in 
v}})

in
#"Get best group"

dufoq3 · ‎07-08-2024

I've found this answer here:

"I would strongly recommend you to not use PowerQuery or VBA for this. There are much much much better libraries in both R and Python for implementing this methodology."

Note: Check this link to learn how to use my query.
Check this link if you don't know how to provide sample data.

foodd · ‎07-08-2024

The sample data is a CSV or .XLSX that is called upon as the Source in your work in progress .PBIX

Please provide your work-in-progress Power BI Desktop file (with sensitive information removed) that covers your issue or question completely in a usable format (not as a screenshot). You can upload the PBIX file to a cloud storage service such as OneDrive, Google Drive, Dropbox, or to a Github repository, and then share a file’s URL.

https://community.fabric.microsoft.com/t5/Community-Blog/How-to-provide-sample-data-in-the-Power-BI-...

Please show the expected outcome based on the sample data you provided.

https://community.fabric.microsoft.com/t5/Desktop/How-to-Get-Your-Question-Answered-Quickly/m-p/1447...

This allows members of the Forum to assess the state of the model, report layer, relationships, and any DAX applied.

MBZA · ‎07-08-2024

This data covers it:

let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45W8s4sKUnNU9JRCgYyMvPSgSxjpVidaKXgxJLSopTESpBUaR6EAZHxzc8vAHJ8U1NBlBFYzCM1JycfyHPPz09JqkwFssyVYmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Text A" = _t, #"Text B" = _t, #"Expected Result" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Text A", type text}, {"Text B", type text}, {"Expected Result", Int64.Type}})
in
#"Changed Type"

Efficiently iterate over and modify a list

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Join our Fabric User Panel

Party with Power BI’s own Guy in a Cube

Efficiently iterate over and modify a list

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Join our Fabric User Panel