Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedDon't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.
Hello
I have a table with ID and 3 name columns. I would like to calculate a similarity rate for each pair of columns, e.g. 1-2, 1-3, 2-3. I've looked-up various videos, forums and GPT and almost found the solution but it does not work how i expected it to work.
It seems that it compares column values not row values.
I've found a post with similar problem where the solution is to pre-filter data before applying fuzzy match.
Calculate row similarity / fuzzy matching - Microsoft Fabric Community
I'm not sure if this would help. Or maybe the whole logic should be changed.
Here is the query.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]),
#"Changed Type4" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),
// Duplicate the table for fuzzy merging and compare
DuplicateTable = #"Changed Type4",
// Merge ALLIN and MD using fuzzy matching
#"Fuzzy1-2" = Table.FuzzyNestedJoin(#"Changed Type4", {"Name1"}, DuplicateTable, {"Name2"}, "FuzzyMatch_1_2", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, NumberOfMatches=1, SimilarityColumnName = "Similarity 1-2", Threshold=0.2]),
#"Expanded FuzzyMatch_1_2" = Table.ExpandTableColumn(#"Fuzzy1-2", "FuzzyMatch_1_2", {"Similarity 1-2"}, {"Similarity 1-2"}),
// Merge ALLIN and Portfolio
#"Fuzzy1-3" = Table.FuzzyNestedJoin(#"Expanded FuzzyMatch_1_2", {"Name1"}, DuplicateTable, {"Name3"}, "FuzzyMatch_1_3", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, NumberOfMatches=1, SimilarityColumnName = "Similarity 1-3", Threshold=0.2]),
#"Expanded FuzzyMatch_1_3" = Table.ExpandTableColumn(#"Fuzzy1-3", "FuzzyMatch_1_3", {"Similarity 1-3"}, {"Similarity 1-3"}),
// Merge Masterdata and portfolio
#"Fuzzy2-3" = Table.FuzzyNestedJoin(#"Expanded FuzzyMatch_1_3", {"Name2"}, DuplicateTable, {"Name3"}, "FuzzyMatch_2_3", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, NumberOfMatches=1, SimilarityColumnName = "Similarity 2-3", Threshold=0.2]),
#"Expanded FuzzyMatch_2_3" = Table.ExpandTableColumn(#"Fuzzy2-3", "FuzzyMatch_2_3", {"Similarity 2-3"}, {"Similarity 2-3"})
in
#"Expanded FuzzyMatch_2_3"
Thanks for help.
Solved! Go to Solution.
Jaccard Similarity only considers character sets, so it ignores letter order. That’s why you’re seeing a similarity of 1 when words contain the same letters but in different positions.
Since letter position matters, try to use the bigram similarity:
let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),
// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,
// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,
// Compute similarity for each pair of columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type", "Similarity 1-2", each BigramSimilarity([Name1], [Name2]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each BigramSimilarity([Name1], [Name3]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each BigramSimilarity([Name2], [Name3]), type number)
in
#"Added Similarity 2-3"
BBF
Solution accepted. I just wanted to do that at the very end. Here is my full statement:
let
Source = #"Key info - Basic info",
#"Removed Other Columns" = Table.SelectColumns(Source,{"EntityId"}),
//Append Portfolio Excel
#"Load_Portfolio_Excel has All In ID" = #"Project portfolio 20-24 excel",
SelectedColumns1 = Table.SelectColumns(#"Load_Portfolio_Excel has All In ID", {"All in ID"}),
#"Changed Type1" = Table.TransformColumnTypes(SelectedColumns1,{{"All in ID", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"All in ID", "EntityId"}}),
#"Appended Query1" = Table.Combine({#"Removed Other Columns", #"Renamed Columns1"}),
#"Removed Duplicates1" = Table.Distinct(#"Appended Query1"),
// Merged with OneFico costs table
#"Load OneFico Costs" = #"ICT costs OneFico",
SelectedColumns2 = Table.SelectColumns(#"Load OneFico Costs", {"All in ID"}),
#"Changed Type2" = Table.TransformColumnTypes(SelectedColumns2,{{"All in ID", type text}}),
#"Renamed Columns2" = Table.RenameColumns(#"Changed Type2",{{"All in ID", "EntityId"}}),
#"Removed Duplicates5" = Table.Distinct(#"Renamed Columns2"),
#"Appended Query2" = Table.Combine({#"Removed Duplicates1", #"Removed Duplicates5"}),
#"Removed Duplicates2" = Table.Distinct(#"Appended Query2"),
// Merged with Master file Main sheet
#"Load main master sheet" = #"Main master sheet",
SelectedColumns3 = Table.SelectColumns(#"Load main master sheet", {"All In"}),
#"Changed Type3" = Table.TransformColumnTypes(SelectedColumns3,{{"All In", type text}}),
#"Renamed Columns3" = Table.RenameColumns(#"Changed Type3",{{"All In", "EntityId"}}),
#"Removed Duplicates" = Table.Distinct(#"Renamed Columns3"),
#"Appended Query3" = Table.Combine({#"Removed Duplicates2", #"Removed Duplicates"}),
#"Removed Duplicates3" = Table.Distinct(#"Appended Query3"),
//Merge all tables returning names
//Merge with Portfolio excel
#"Merged Portfolio xl" = Table.NestedJoin(#"Removed Duplicates3", {"EntityId"}, #"Project portfolio 20-24 excel", {"All in ID"}, "Project portfolio 20-24 excel", JoinKind.LeftOuter),
#"Expanded Project portfolio 20-24 excel" = Table.ExpandTableColumn(#"Merged Portfolio xl", "Project portfolio 20-24 excel", {"Project name"}, {"PortfolioXL.Name"}),
#"Remove duplicates" = Table.Distinct(#"Expanded Project portfolio 20-24 excel"),
//Merge with Main master sheet
MergedMasterSheet = Table.NestedJoin(#"Remove duplicates", {"EntityId"}, #"Main master sheet", {"All In"}, "MasterdataSheet", JoinKind.LeftOuter),
#"Expanded MasterdataSheet" = Table.ExpandTableColumn(MergedMasterSheet, "MasterdataSheet", {"Name in SAP"}, {"MasterdataSheet.Name"}),
//Merge with All In names
MergedAllin = Table.NestedJoin(#"Expanded MasterdataSheet", {"EntityId"}, #"Key info - Basic info", {"EntityId"}, "Key info - Basic info", JoinKind.LeftOuter),
#"Expanded Key info - Basic info" = Table.ExpandTableColumn(MergedAllin, "Key info - Basic info", {"Project name"}, {"All In.Name"}),
#"Uppercased Text" = Table.TransformColumns(#"Expanded Key info - Basic info",{{"PortfolioXL.Name", Text.Upper, type text}, {"MasterdataSheet.Name", Text.Upper, type text}, {"All In.Name", Text.Upper, type text}}),
#"Removed Duplicates4" = Table.Distinct(#"Uppercased Text"),
#"Changed Type4" = Table.TransformColumnTypes(#"Removed Duplicates4",{{"PortfolioXL.Name", type text}, {"MasterdataSheet.Name", type text}, {"All In.Name", type text}})
in
#"Changed Type4"
EntityId = ID
All In.Name = Name1
MasterdataSheet.Name = Name2
PortfolioXL.Name = Name3
thanks
@bigk it should be:
let
Source = #"Key info - Basic info",
#"Removed Other Columns" = Table.SelectColumns(Source,{"EntityId"}),
// Append Portfolio Excel
#"Load_Portfolio_Excel has All In ID" = #"Project portfolio 20-24 excel",
SelectedColumns1 = Table.SelectColumns(#"Load_Portfolio_Excel has All In ID", {"All in ID"}),
#"Changed Type1" = Table.TransformColumnTypes(SelectedColumns1,{{"All in ID", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"All in ID", "EntityId"}}),
#"Appended Query1" = Table.Combine({#"Removed Other Columns", #"Renamed Columns1"}),
#"Removed Duplicates1" = Table.Distinct(#"Appended Query1"),
// Merged with OneFico costs table
#"Load OneFico Costs" = #"ICT costs OneFico",
SelectedColumns2 = Table.SelectColumns(#"Load OneFico Costs", {"All in ID"}),
#"Changed Type2" = Table.TransformColumnTypes(SelectedColumns2,{{"All in ID", type text}}),
#"Renamed Columns2" = Table.RenameColumns(#"Changed Type2",{{"All in ID", "EntityId"}}),
#"Removed Duplicates5" = Table.Distinct(#"Renamed Columns2"),
#"Appended Query2" = Table.Combine({#"Removed Duplicates1", #"Removed Duplicates5"}),
#"Removed Duplicates2" = Table.Distinct(#"Appended Query2"),
// Merged with Master file Main sheet
#"Load main master sheet" = #"Main master sheet",
SelectedColumns3 = Table.SelectColumns(#"Load main master sheet", {"All In"}),
#"Changed Type3" = Table.TransformColumnTypes(SelectedColumns3,{{"All In", type text}}),
#"Renamed Columns3" = Table.RenameColumns(#"Changed Type3",{{"All In", "EntityId"}}),
#"Removed Duplicates" = Table.Distinct(#"Renamed Columns3"),
#"Appended Query3" = Table.Combine({#"Removed Duplicates2", #"Removed Duplicates"}),
#"Removed Duplicates3" = Table.Distinct(#"Appended Query3"),
// Merge all tables returning names
#"Merged Portfolio xl" = Table.NestedJoin(#"Removed Duplicates3", {"EntityId"}, #"Project portfolio 20-24 excel", {"All in ID"}, "Project portfolio 20-24 excel", JoinKind.LeftOuter),
#"Expanded Project portfolio 20-24 excel" = Table.ExpandTableColumn(#"Merged Portfolio xl", "Project portfolio 20-24 excel", {"Project name"}, {"PortfolioXL.Name"}),
#"Remove duplicates" = Table.Distinct(#"Expanded Project portfolio 20-24 excel"),
// Merge with Main master sheet
MergedMasterSheet = Table.NestedJoin(#"Remove duplicates", {"EntityId"}, #"Main master sheet", {"All In"}, "MasterdataSheet", JoinKind.LeftOuter),
#"Expanded MasterdataSheet" = Table.ExpandTableColumn(MergedMasterSheet, "MasterdataSheet", {"Name in SAP"}, {"MasterdataSheet.Name"}),
// Merge with All In names
MergedAllin = Table.NestedJoin(#"Expanded MasterdataSheet", {"EntityId"}, #"Key info - Basic info", {"EntityId"}, "Key info - Basic info", JoinKind.LeftOuter),
#"Expanded Key info - Basic info" = Table.ExpandTableColumn(MergedAllin, "Key info - Basic info", {"Project name"}, {"All In.Name"}),
// Uppercase all name columns
#"Uppercased Text" = Table.TransformColumns(#"Expanded Key info - Basic info",{{"PortfolioXL.Name", Text.Upper, type text}, {"MasterdataSheet.Name", Text.Upper, type text}, {"All In.Name", Text.Upper, type text}}),
#"Removed Duplicates4" = Table.Distinct(#"Uppercased Text"),
#"Changed Type4" = Table.TransformColumnTypes(#"Removed Duplicates4",{{"PortfolioXL.Name", type text}, {"MasterdataSheet.Name", type text}, {"All In.Name", type text}}),
// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,
// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,
// Combine all three columns (Name1, Name2, Name3) into a single string
CombineNames = Table.AddColumn(#"Changed Type4", "CombinedNames", each Text.Combine({[All In.Name], [MasterdataSheet.Name], [PortfolioXL.Name]}, " "), type text),
// Compute similarity for the combined string (compares the combined names within the same row)
#"Added Combined Similarity" = Table.AddColumn(CombineNames, "CombinedSimilarity", each BigramSimilarity([CombinedNames], [CombinedNames]), type number),
// Optionally, remove the "CombinedNames" column if you don't need it
Result = Table.RemoveColumns(#"Added Combined Similarity", {"CombinedNames"})
in
Result
BBF
Thanks for this too. Can you please tell me how the combined similarity result work? Is the 3 max value?
Also can you please add query with the second option with bigram only for pairs of columns. Not combining them.
@bigk Here the code for similarity in pairs:
let
Source = #"Key info - Basic info",
#"Removed Other Columns" = Table.SelectColumns(Source,{"EntityId"}),
// Append Portfolio Excel
#"Load_Portfolio_Excel has All In ID" = #"Project portfolio 20-24 excel",
SelectedColumns1 = Table.SelectColumns(#"Load_Portfolio_Excel has All In ID", {"All in ID"}),
#"Changed Type1" = Table.TransformColumnTypes(SelectedColumns1,{{"All in ID", type text}}),
#"Renamed Columns1" = Table.RenameColumns(#"Changed Type1",{{"All in ID", "EntityId"}}),
#"Appended Query1" = Table.Combine({#"Removed Other Columns", #"Renamed Columns1"}),
#"Removed Duplicates1" = Table.Distinct(#"Appended Query1"),
// Merged with OneFico costs table
#"Load OneFico Costs" = #"ICT costs OneFico",
SelectedColumns2 = Table.SelectColumns(#"Load OneFico Costs", {"All in ID"}),
#"Changed Type2" = Table.TransformColumnTypes(SelectedColumns2,{{"All in ID", type text}}),
#"Renamed Columns2" = Table.RenameColumns(#"Changed Type2",{{"All in ID", "EntityId"}}),
#"Removed Duplicates5" = Table.Distinct(#"Renamed Columns2"),
#"Appended Query2" = Table.Combine({#"Removed Duplicates1", #"Removed Duplicates5"}),
#"Removed Duplicates2" = Table.Distinct(#"Appended Query2"),
// Merged with Master file Main sheet
#"Load main master sheet" = #"Main master sheet",
SelectedColumns3 = Table.SelectColumns(#"Load main master sheet", {"All In"}),
#"Changed Type3" = Table.TransformColumnTypes(SelectedColumns3,{{"All In", type text}}),
#"Renamed Columns3" = Table.RenameColumns(#"Changed Type3",{{"All In", "EntityId"}}),
#"Removed Duplicates" = Table.Distinct(#"Renamed Columns3"),
#"Appended Query3" = Table.Combine({#"Removed Duplicates2", #"Removed Duplicates"}),
#"Removed Duplicates3" = Table.Distinct(#"Appended Query3"),
// Merge all tables returning names
#"Merged Portfolio xl" = Table.NestedJoin(#"Removed Duplicates3", {"EntityId"}, #"Project portfolio 20-24 excel", {"All in ID"}, "Project portfolio 20-24 excel", JoinKind.LeftOuter),
#"Expanded Project portfolio 20-24 excel" = Table.ExpandTableColumn(#"Merged Portfolio xl", "Project portfolio 20-24 excel", {"Project name"}, {"PortfolioXL.Name"}),
#"Remove duplicates" = Table.Distinct(#"Expanded Project portfolio 20-24 excel"),
// Merge with Main master sheet
MergedMasterSheet = Table.NestedJoin(#"Remove duplicates", {"EntityId"}, #"Main master sheet", {"All In"}, "MasterdataSheet", JoinKind.LeftOuter),
#"Expanded MasterdataSheet" = Table.ExpandTableColumn(MergedMasterSheet, "MasterdataSheet", {"Name in SAP"}, {"MasterdataSheet.Name"}),
// Merge with All In names
MergedAllin = Table.NestedJoin(#"Expanded MasterdataSheet", {"EntityId"}, #"Key info - Basic info", {"EntityId"}, "Key info - Basic info", JoinKind.LeftOuter),
#"Expanded Key info - Basic info" = Table.ExpandTableColumn(MergedAllin, "Key info - Basic info", {"Project name"}, {"All In.Name"}),
// Uppercase all name columns
#"Uppercased Text" = Table.TransformColumns(#"Expanded Key info - Basic info",{{"PortfolioXL.Name", Text.Upper, type text}, {"MasterdataSheet.Name", Text.Upper, type text}, {"All In.Name", Text.Upper, type text}}),
#"Removed Duplicates4" = Table.Distinct(#"Uppercased Text"),
#"Changed Type4" = Table.TransformColumnTypes(#"Removed Duplicates4",{{"PortfolioXL.Name", type text}, {"MasterdataSheet.Name", type text}, {"All In.Name", type text}}),
// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,
// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,
// Compute similarity for each pair of columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type4", "Similarity 1-2", each BigramSimilarity([All In.Name], [MasterdataSheet.Name]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each BigramSimilarity([All In.Name], [PortfolioXL.Name]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each BigramSimilarity([MasterdataSheet.Name], [PortfolioXL.Name]), type number)
in
#"Added Similarity 2-3"
the total similarity was calculated using a union-based Jaccard similarity over all three columns combined.
BBF
The similarity generates number above 1 which should not be the case. for example. I would expect 100% match between 2 and 3 instead of 150%
and for my real data source, where should i place the let _t function? I have many steps after reaching the desired table to check for similarity. The last step is:
#"Reordered Columns" = Table.ReorderColumns(#"Changed Type4", {"EntityId", "Name1", "Name2", "Name3"}),
And what if i have a spaces in column names? Should i add quotes for column names in the let_t row?
@bigk First of all, if the solution provided is correct, accept the prev answer as solution.
Then, if you give me your M code of the advanced editor, i can try to adapt it with the similarity calculation.
BBF
Perfect!! Seems to work. Is there a way to compare 3 columns in one go and not by pairs?
@bigk Here the overall similarity:
let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),
// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,
// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,
// Combine all three columns (Name1, Name2, Name3) into a single string
CombineNames = Table.AddColumn(#"Changed Type", "CombinedNames", each Text.Combine({[Name1], [Name2], [Name3]}, " "), type text),
// Compute similarity for the combined string (compares the combined names within the same row)
#"Added Combined Similarity" = Table.AddColumn(CombineNames, "CombinedSimilarity", each BigramSimilarity([CombinedNames], [CombinedNames]), type number),
// Optionally, remove the "CombinedNames" column if you don't need it
Result = Table.RemoveColumns(#"Added Combined Similarity", {"CombinedNames"})
in
Result
Please accept my answers as solutions!
BBF
Combined similarity gives values above 1. What is the max value then? Is it 3 because of 3 columns?
Wait.. i think i know why. That is because the Jaccard similarity does not consider location of the letter. So John B = B John. Correct?
Jaccard Similarity only considers character sets, so it ignores letter order. That’s why you’re seeing a similarity of 1 when words contain the same letters but in different positions.
Since letter position matters, try to use the bigram similarity:
let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),
// Function to generate bigrams (2-letter sequences)
GenerateBigrams = (text as nullable text) as list =>
let
cleanText = Text.Lower(Text.Trim(if text = null then "" else text)),
chars = Text.ToList(cleanText),
bigrams = List.Transform({0..List.Count(chars)-2}, each Text.Combine(List.FirstN(List.Skip(chars, _), 2)))
in
if List.Count(chars) < 2 then {cleanText} else bigrams,
// Function to compute bigram similarity
BigramSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
bigrams1 = GenerateBigrams(text1),
bigrams2 = GenerateBigrams(text2),
intersection = List.Intersect({bigrams1, bigrams2}),
union = List.Distinct(List.Combine({bigrams1, bigrams2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,
// Compute similarity for each pair of columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type", "Similarity 1-2", each BigramSimilarity([Name1], [Name2]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each BigramSimilarity([Name1], [Name3]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each BigramSimilarity([Name2], [Name3]), type number)
in
#"Added Similarity 2-3"
BBF
Thanks @BeaBF
Numbers seems to becoming more realistic. However the first row still have 1 1 1 similarity which is not true. Any clue why this happens?
@bigk Hi! Try with the Jaccard Similarity.
let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText("i45WMlTSUfLKz8hTcEJmOCmAmEqxOtFKRkCub2JRpYKbHpDlpqcA5GSmQgRBoqV6YGVo5kBpf5AhsQA=", BinaryEncoding.Base64),
Compression.Deflate
)
),
let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name1 = _t, Name2 = _t, Name3 = _t]
),
#"Changed Type" = Table.TransformColumnTypes(Source, {{"ID", Int64.Type}, {"Name1", type text}, {"Name2", type text}, {"Name3", type text}}),
// Function to compute Jaccard Similarity
JaccardSimilarity = (text1 as nullable text, text2 as nullable text) as number =>
let
s1 = if text1 = null then "" else Text.Lower(Text.Trim(text1)),
s2 = if text2 = null then "" else Text.Lower(Text.Trim(text2)),
set1 = List.Distinct(Text.ToList(s1)),
set2 = List.Distinct(Text.ToList(s2)),
intersection = List.Intersect({set1, set2}),
union = List.Distinct(List.Combine({set1, set2})),
similarity = if List.Count(union) = 0 then 0 else Number.Round(List.Count(intersection) / List.Count(union), 2)
in
similarity,
// Compute similarity for each pair of name columns
#"Added Similarity 1-2" = Table.AddColumn(#"Changed Type", "Similarity 1-2", each JaccardSimilarity([Name1], [Name2]), type number),
#"Added Similarity 1-3" = Table.AddColumn(#"Added Similarity 1-2", "Similarity 1-3", each JaccardSimilarity([Name1], [Name3]), type number),
#"Added Similarity 2-3" = Table.AddColumn(#"Added Similarity 1-3", "Similarity 2-3", each JaccardSimilarity([Name2], [Name3]), type number)
in
#"Added Similarity 2-3"
BBF
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Check out the January 2025 Power BI update to learn about new features in Reporting, Modeling, and Data Connectivity.