Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now! Learn more

Reply
bigk
Helper II
Helper II

Row level similarity rate on multiple columns

Hello

 

I have a table with ID and 3 name columns. I would like to calculate a similarity rate for each pair of columns, e.g. 1-2, 1-3, 2-3. I've looked-up various videos, forums and GPT and almost found the solution but it does not work how i expected it to work. 

It seems that it compares column values not row values. 

 

bigk_1-1738135515571.png

I've found a post with similar problem where the solution is to pre-filter data before applying fuzzy match.

Calculate row similarity / fuzzy matching - Microsoft Fabric Community

 

I'm not sure if this would help. Or maybe the whole logic should be changed.

Here is the query.

 

let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]),
#"Changed Type4" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),

// Duplicate the table for fuzzy merging and compare
DuplicateTable = #"Changed Type4",
// Merge ALLIN and MD using fuzzy matching
#"Fuzzy1-2" = Table.FuzzyNestedJoin(#"Changed Type4", {"Name1"}, DuplicateTable, {"Name2"}, "FuzzyMatch_1_2", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, NumberOfMatches=1, SimilarityColumnName = "Similarity 1-2", Threshold=0.2]),
#"Expanded FuzzyMatch_1_2" = Table.ExpandTableColumn(#"Fuzzy1-2", "FuzzyMatch_1_2", {"Similarity 1-2"}, {"Similarity 1-2"}),
// Merge ALLIN and Portfolio
#"Fuzzy1-3" = Table.FuzzyNestedJoin(#"Expanded FuzzyMatch_1_2", {"Name1"}, DuplicateTable, {"Name3"}, "FuzzyMatch_1_3", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, NumberOfMatches=1, SimilarityColumnName = "Similarity 1-3", Threshold=0.2]),
#"Expanded FuzzyMatch_1_3" = Table.ExpandTableColumn(#"Fuzzy1-3", "FuzzyMatch_1_3", {"Similarity 1-3"}, {"Similarity 1-3"}),
// Merge Masterdata and portfolio
#"Fuzzy2-3" = Table.FuzzyNestedJoin(#"Expanded FuzzyMatch_1_3", {"Name2"}, DuplicateTable, {"Name3"}, "FuzzyMatch_2_3", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, NumberOfMatches=1, SimilarityColumnName = "Similarity 2-3", Threshold=0.2]),
#"Expanded FuzzyMatch_2_3" = Table.ExpandTableColumn(#"Fuzzy2-3", "FuzzyMatch_2_3", {"Similarity 2-3"}, {"Similarity 2-3"})

in
#"Expanded FuzzyMatch_2_3"

 

Thanks for help.

 

 

 

1 ACCEPTED SOLUTION

@bigk 

Jaccard Similarity only considers character sets, so it ignores letter order. That’s why you’re seeing a similarity of 1 when words contain the same letters but in different positions.

Since letter position matters, try to use the bigram similarity:

 

let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),

// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,

// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,

// Compute similarity for each pair of columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type", "Similarity 1-2", each BigramSimilarity([Name1], [Name2]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each BigramSimilarity([Name1], [Name3]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each BigramSimilarity([Name2], [Name3]), type number)
in
#"Added Similarity 2-3"

 

BBF

View solution in original post

15 REPLIES 15
bigk
Helper II
Helper II

Solution accepted. I just wanted to do that at the very end. Here is my full statement:

let
Source = #"Key info - Basic info",
#"Removed Other Columns" = Table.SelectColumns(Source,{"EntityId"}),


//Append Portfolio Excel
#"Load_Portfolio_Excel has All In ID" = #"Project portfolio 20-24 excel",
SelectedColumns1 = Table.SelectColumns(#"Load_Portfolio_Excel has All In ID", {"All in ID"}),
#"Changed Type1" = Table.TransformColumnTypes(SelectedColumns1,{{"All in ID", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"All in ID", "EntityId"}}),
#"Appended Query1" = Table.Combine({#"Removed Other Columns", #"Renamed Columns1"}),
#"Removed Duplicates1" = Table.Distinct(#"Appended Query1"),


// Merged with OneFico costs table
#"Load OneFico Costs" = #"ICT costs OneFico",
SelectedColumns2 = Table.SelectColumns(#"Load OneFico Costs", {"All in ID"}),
#"Changed Type2" = Table.TransformColumnTypes(SelectedColumns2,{{"All in ID", type text}}),
#"Renamed Columns2" = Table.RenameColumns(#"Changed Type2",{{"All in ID", "EntityId"}}),
#"Removed Duplicates5" = Table.Distinct(#"Renamed Columns2"),
#"Appended Query2" = Table.Combine({#"Removed Duplicates1", #"Removed Duplicates5"}),
#"Removed Duplicates2" = Table.Distinct(#"Appended Query2"),
// Merged with Master file Main sheet
#"Load main master sheet" = #"Main master sheet",
SelectedColumns3 = Table.SelectColumns(#"Load main master sheet", {"All In"}),
#"Changed Type3" = Table.TransformColumnTypes(SelectedColumns3,{{"All In", type text}}),
#"Renamed Columns3" = Table.RenameColumns(#"Changed Type3",{{"All In", "EntityId"}}),
#"Removed Duplicates" = Table.Distinct(#"Renamed Columns3"),
#"Appended Query3" = Table.Combine({#"Removed Duplicates2", #"Removed Duplicates"}),
#"Removed Duplicates3" = Table.Distinct(#"Appended Query3"),

//Merge all tables returning names
//Merge with Portfolio excel
#"Merged Portfolio xl" = Table.NestedJoin(#"Removed Duplicates3", {"EntityId"}, #"Project portfolio 20-24 excel", {"All in ID"}, "Project portfolio 20-24 excel", JoinKind.LeftOuter),
#"Expanded Project portfolio 20-24 excel" = Table.ExpandTableColumn(#"Merged Portfolio xl", "Project portfolio 20-24 excel", {"Project name"}, {"PortfolioXL.Name"}),
#"Remove duplicates" = Table.Distinct(#"Expanded Project portfolio 20-24 excel"),
//Merge with Main master sheet
MergedMasterSheet = Table.NestedJoin(#"Remove duplicates", {"EntityId"}, #"Main master sheet", {"All In"}, "MasterdataSheet", JoinKind.LeftOuter),
#"Expanded MasterdataSheet" = Table.ExpandTableColumn(MergedMasterSheet, "MasterdataSheet", {"Name in SAP"}, {"MasterdataSheet.Name"}),

//Merge with All In names
MergedAllin = Table.NestedJoin(#"Expanded MasterdataSheet", {"EntityId"}, #"Key info - Basic info", {"EntityId"}, "Key info - Basic info", JoinKind.LeftOuter),
#"Expanded Key info - Basic info" = Table.ExpandTableColumn(MergedAllin, "Key info - Basic info", {"Project name"}, {"All In.Name"}),
#"Uppercased Text" = Table.TransformColumns(#"Expanded Key info - Basic info",{{"PortfolioXL.Name", Text.Upper, type text}, {"MasterdataSheet.Name", Text.Upper, type text}, {"All In.Name", Text.Upper, type text}}),
#"Removed Duplicates4" = Table.Distinct(#"Uppercased Text"),
#"Changed Type4" = Table.TransformColumnTypes(#"Removed Duplicates4",{{"PortfolioXL.Name", type text}, {"MasterdataSheet.Name", type text}, {"All In.Name", type text}})
in
#"Changed Type4"

EntityId = ID
All In.Name = Name1

MasterdataSheet.Name = Name2
PortfolioXL.Name = Name3

 

thanks

@bigk it should be:

 

let
Source = #"Key info - Basic info",
#"Removed Other Columns" = Table.SelectColumns(Source,{"EntityId"}),

// Append Portfolio Excel
#"Load_Portfolio_Excel has All In ID" = #"Project portfolio 20-24 excel",
SelectedColumns1 = Table.SelectColumns(#"Load_Portfolio_Excel has All In ID", {"All in ID"}),
#"Changed Type1" = Table.TransformColumnTypes(SelectedColumns1,{{"All in ID", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"All in ID", "EntityId"}}),
#"Appended Query1" = Table.Combine({#"Removed Other Columns", #"Renamed Columns1"}),
#"Removed Duplicates1" = Table.Distinct(#"Appended Query1"),

// Merged with OneFico costs table
#"Load OneFico Costs" = #"ICT costs OneFico",
SelectedColumns2 = Table.SelectColumns(#"Load OneFico Costs", {"All in ID"}),
#"Changed Type2" = Table.TransformColumnTypes(SelectedColumns2,{{"All in ID", type text}}),
#"Renamed Columns2" = Table.RenameColumns(#"Changed Type2",{{"All in ID", "EntityId"}}),
#"Removed Duplicates5" = Table.Distinct(#"Renamed Columns2"),
#"Appended Query2" = Table.Combine({#"Removed Duplicates1", #"Removed Duplicates5"}),
#"Removed Duplicates2" = Table.Distinct(#"Appended Query2"),

// Merged with Master file Main sheet
#"Load main master sheet" = #"Main master sheet",
SelectedColumns3 = Table.SelectColumns(#"Load main master sheet", {"All In"}),
#"Changed Type3" = Table.TransformColumnTypes(SelectedColumns3,{{"All In", type text}}),
#"Renamed Columns3" = Table.RenameColumns(#"Changed Type3",{{"All In", "EntityId"}}),
#"Removed Duplicates" = Table.Distinct(#"Renamed Columns3"),
#"Appended Query3" = Table.Combine({#"Removed Duplicates2", #"Removed Duplicates"}),
#"Removed Duplicates3" = Table.Distinct(#"Appended Query3"),

// Merge all tables returning names
#"Merged Portfolio xl" = Table.NestedJoin(#"Removed Duplicates3", {"EntityId"}, #"Project portfolio 20-24 excel", {"All in ID"}, "Project portfolio 20-24 excel", JoinKind.LeftOuter),
#"Expanded Project portfolio 20-24 excel" = Table.ExpandTableColumn(#"Merged Portfolio xl", "Project portfolio 20-24 excel", {"Project name"}, {"PortfolioXL.Name"}),
#"Remove duplicates" = Table.Distinct(#"Expanded Project portfolio 20-24 excel"),

// Merge with Main master sheet
MergedMasterSheet = Table.NestedJoin(#"Remove duplicates", {"EntityId"}, #"Main master sheet", {"All In"}, "MasterdataSheet", JoinKind.LeftOuter),
#"Expanded MasterdataSheet" = Table.ExpandTableColumn(MergedMasterSheet, "MasterdataSheet", {"Name in SAP"}, {"MasterdataSheet.Name"}),

// Merge with All In names
MergedAllin = Table.NestedJoin(#"Expanded MasterdataSheet", {"EntityId"}, #"Key info - Basic info", {"EntityId"}, "Key info - Basic info", JoinKind.LeftOuter),
#"Expanded Key info - Basic info" = Table.ExpandTableColumn(MergedAllin, "Key info - Basic info", {"Project name"}, {"All In.Name"}),

// Uppercase all name columns
#"Uppercased Text" = Table.TransformColumns(#"Expanded Key info - Basic info",{{"PortfolioXL.Name", Text.Upper, type text}, {"MasterdataSheet.Name", Text.Upper, type text}, {"All In.Name", Text.Upper, type text}}),
#"Removed Duplicates4" = Table.Distinct(#"Uppercased Text"),
#"Changed Type4" = Table.TransformColumnTypes(#"Removed Duplicates4",{{"PortfolioXL.Name", type text}, {"MasterdataSheet.Name", type text}, {"All In.Name", type text}}),

// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,

// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,

// Combine all three columns (Name1, Name2, Name3) into a single string
CombineNames = Table.AddColumn(#"Changed Type4", "CombinedNames", each Text.Combine({[All In.Name], [MasterdataSheet.Name], [PortfolioXL.Name]}, " "), type text),

// Compute similarity for the combined string (compares the combined names within the same row)
#"Added Combined Similarity" = Table.AddColumn(CombineNames, "CombinedSimilarity", each BigramSimilarity([CombinedNames], [CombinedNames]), type number),

// Optionally, remove the "CombinedNames" column if you don't need it
Result = Table.RemoveColumns(#"Added Combined Similarity", {"CombinedNames"})
in
Result

 

BBF

Thanks for this too. Can you please tell me how the combined similarity result work? Is the 3 max value?
Also can you please add query with the second option with bigram only for pairs of columns. Not combining them.

@bigk Here the code for similarity in pairs:

let
Source = #"Key info - Basic info",
#"Removed Other Columns" = Table.SelectColumns(Source,{"EntityId"}),

// Append Portfolio Excel
#"Load_Portfolio_Excel has All In ID" = #"Project portfolio 20-24 excel",
SelectedColumns1 = Table.SelectColumns(#"Load_Portfolio_Excel has All In ID", {"All in ID"}),
#"Changed Type1" = Table.TransformColumnTypes(SelectedColumns1,{{"All in ID", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"All in ID", "EntityId"}}),
#"Appended Query1" = Table.Combine({#"Removed Other Columns", #"Renamed Columns1"}),
#"Removed Duplicates1" = Table.Distinct(#"Appended Query1"),

// Merged with OneFico costs table
#"Load OneFico Costs" = #"ICT costs OneFico",
SelectedColumns2 = Table.SelectColumns(#"Load OneFico Costs", {"All in ID"}),
#"Changed Type2" = Table.TransformColumnTypes(SelectedColumns2,{{"All in ID", type text}}),
#"Renamed Columns2" = Table.RenameColumns(#"Changed Type2",{{"All in ID", "EntityId"}}),
#"Removed Duplicates5" = Table.Distinct(#"Renamed Columns2"),
#"Appended Query2" = Table.Combine({#"Removed Duplicates1", #"Removed Duplicates5"}),
#"Removed Duplicates2" = Table.Distinct(#"Appended Query2"),

// Merged with Master file Main sheet
#"Load main master sheet" = #"Main master sheet",
SelectedColumns3 = Table.SelectColumns(#"Load main master sheet", {"All In"}),
#"Changed Type3" = Table.TransformColumnTypes(SelectedColumns3,{{"All In", type text}}),
#"Renamed Columns3" = Table.RenameColumns(#"Changed Type3",{{"All In", "EntityId"}}),
#"Removed Duplicates" = Table.Distinct(#"Renamed Columns3"),
#"Appended Query3" = Table.Combine({#"Removed Duplicates2", #"Removed Duplicates"}),
#"Removed Duplicates3" = Table.Distinct(#"Appended Query3"),

// Merge all tables returning names
#"Merged Portfolio xl" = Table.NestedJoin(#"Removed Duplicates3", {"EntityId"}, #"Project portfolio 20-24 excel", {"All in ID"}, "Project portfolio 20-24 excel", JoinKind.LeftOuter),
#"Expanded Project portfolio 20-24 excel" = Table.ExpandTableColumn(#"Merged Portfolio xl", "Project portfolio 20-24 excel", {"Project name"}, {"PortfolioXL.Name"}),
#"Remove duplicates" = Table.Distinct(#"Expanded Project portfolio 20-24 excel"),

// Merge with Main master sheet
MergedMasterSheet = Table.NestedJoin(#"Remove duplicates", {"EntityId"}, #"Main master sheet", {"All In"}, "MasterdataSheet", JoinKind.LeftOuter),
#"Expanded MasterdataSheet" = Table.ExpandTableColumn(MergedMasterSheet, "MasterdataSheet", {"Name in SAP"}, {"MasterdataSheet.Name"}),

// Merge with All In names
MergedAllin = Table.NestedJoin(#"Expanded MasterdataSheet", {"EntityId"}, #"Key info - Basic info", {"EntityId"}, "Key info - Basic info", JoinKind.LeftOuter),
#"Expanded Key info - Basic info" = Table.ExpandTableColumn(MergedAllin, "Key info - Basic info", {"Project name"}, {"All In.Name"}),

// Uppercase all name columns
#"Uppercased Text" = Table.TransformColumns(#"Expanded Key info - Basic info",{{"PortfolioXL.Name", Text.Upper, type text}, {"MasterdataSheet.Name", Text.Upper, type text}, {"All In.Name", Text.Upper, type text}}),
#"Removed Duplicates4" = Table.Distinct(#"Uppercased Text"),
#"Changed Type4" = Table.TransformColumnTypes(#"Removed Duplicates4",{{"PortfolioXL.Name", type text}, {"MasterdataSheet.Name", type text}, {"All In.Name", type text}}),

// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,

// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,

// Compute similarity for each pair of columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type4", "Similarity 1-2", each BigramSimilarity([All In.Name], [MasterdataSheet.Name]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each BigramSimilarity([All In.Name], [PortfolioXL.Name]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each BigramSimilarity([MasterdataSheet.Name], [PortfolioXL.Name]), type number)
in
#"Added Similarity 2-3"

the total similarity was calculated using a union-based Jaccard similarity over all three columns combined.

 

BBF

The similarity generates number above 1 which should not be the case. for example. I would expect 100% match between 2 and 3 instead of 150%

bigk_0-1738155205137.png

 

bigk
Helper II
Helper II

and for my real data source, where should i place the let _t function?  I have many steps after reaching the desired table to check for similarity. The last step is:

#"Reordered Columns" = Table.ReorderColumns(#"Changed Type4", {"EntityId", "Name1", "Name2", "Name3"}),

And what if i have a spaces in column names? Should i add quotes for column names in the let_t row?

@bigk First of all, if the solution provided is correct, accept the prev answer as solution.

 

Then, if you give me your M code of the advanced editor, i can try to adapt it with the similarity calculation.

 

BBF

bigk
Helper II
Helper II

Perfect!! Seems to work. Is there a way to compare 3 columns in one go and not by pairs?

@bigk Here the overall similarity:

let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),

// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,

// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,

// Combine all three columns (Name1, Name2, Name3) into a single string
CombineNames = Table.AddColumn(#"Changed Type", "CombinedNames", each Text.Combine({[Name1], [Name2], [Name3]}, " "), type text),

// Compute similarity for the combined string (compares the combined names within the same row)
#"Added Combined Similarity" = Table.AddColumn(CombineNames, "CombinedSimilarity", each BigramSimilarity([CombinedNames], [CombinedNames]), type number),

// Optionally, remove the "CombinedNames" column if you don't need it
Result = Table.RemoveColumns(#"Added Combined Similarity", {"CombinedNames"})
in
Result

 

Please accept my answers as solutions!

 

BBF

Combined similarity gives values above 1. What is the max value then? Is it 3 because of 3 columns?

@bigk Yes, so try to make /3 to obtain the similarity overall.

 

BBF

bigk
Helper II
Helper II

Wait.. i think i know why. That is because the Jaccard similarity does not consider location of the letter. So John B = B John. Correct?

@bigk 

Jaccard Similarity only considers character sets, so it ignores letter order. That’s why you’re seeing a similarity of 1 when words contain the same letters but in different positions.

Since letter position matters, try to use the bigram similarity:

 

let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),

// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,

// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,

// Compute similarity for each pair of columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type", "Similarity 1-2", each BigramSimilarity([Name1], [Name2]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each BigramSimilarity([Name1], [Name3]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each BigramSimilarity([Name2], [Name3]), type number)
in
#"Added Similarity 2-3"

 

BBF

bigk
Helper II
Helper II

Thanks @BeaBF 

Numbers seems to becoming more realistic. However the first row still have 1 1 1 similarity which is not true. Any clue why this happens?

BeaBF
Super User
Super User

@bigk Hi! Try with the Jaccard Similarity.

let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),

// Function to compute Jaccard Similarity
JaccardSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
s1 = if text1 = null then "" else Text.Lower(Text.Trim(text1)),
s2 = if text2 = null then "" else Text.Lower(Text.Trim(text2)),
set1 = List.Distinct(Text.ToList(s1)),
set2 = List.Distinct(Text.ToList(s2)),
intersection = List.Intersect({set1, set2}),
union = List.Distinct(List.Combine({set1, set2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,

// Compute similarity for each pair of name columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type", "Similarity 1-2", each JaccardSimilarity([Name1], [Name2]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each JaccardSimilarity([Name1], [Name3]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each JaccardSimilarity([Name2], [Name3]), type number)
in
#"Added Similarity 2-3"

 

BBF

Helpful resources

Announcements
Power BI DataViz World Championships

Power BI Dataviz World Championships

The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now!

December 2025 Power BI Update Carousel

Power BI Monthly Update - December 2025

Check out the December 2025 Power BI Holiday Recap!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.