March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi,
I have a table with two columns containing a text string like this:
Name 1 | Name 2 |
Savoy Suites | Savoy Suites Hotel Apartments |
NH Prague City | NH Prague City |
Badhotel Scheveningen | Badhotel Scheveningen |
Terme Tuhelj en Camping | Terme Tuhelj |
Les Lauriers Roses | Belambra Club Les Lauriers Roses |
Nuovo Natural Village | Natural Village Resort |
Torfhaus Harzresort | Torfhaus Harzresort |
Muenster Kongresscenter Affiliated by Melia | Hotel Münster Kongresscenter Affiliated by Meliá |
Royal Azur Thalasso Golf | Royal Azur Thalasso Golf |
Eperland | Hotel Eperland |
Is it possible creating a new column (or using a measure) to calculate the Jaccard Index for these two strings?
Thanks in advance!
Best,
Jori
Solved! Go to Solution.
Hi @jppv20
If the pairs of strings are known in advance, I would say this task more suited to Power Query. You could create a custom function for Jaccard Similarity.
However, it is interesting to look at doing this with DAX 🙂
I have attached an example with a Jaccard Similarity calculated column (which could be adapted to a measure).
I assumed that strings would be tokenized as words, delimited by spaces. Does this match your methodolgy? If not, I'm sure it can be modified to suit.
Here is the code for the calculated column:
Jaccard Similarity =
VAR String1 = Data[Name 1]
VAR String2 = Data[Name 2]
-- Tokenize strings using space
VAR Delimiter = " "
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|"
VAR String1_BarDelimited =
SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
SELECTCOLUMNS (
GENERATESERIES ( 1, String1_TokenCount ),
"Token", PATHITEM ( String1_BarDelimited, [Value] )
)
VAR String2_Tokens =
SELECTCOLUMNS (
GENERATESERIES ( 1, String2_TokenCount ),
"Token", PATHITEM ( String2_BarDelimited, [Value] )
)
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
SELECTCOLUMNS (
GENERATE (
Tokens_Union_Distinct,
VAR CurrentToken = [Token]
VAR String1_Count =
COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
VAR String2_Count =
COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
VAR Max_Count =
MAX ( String1_Count, String2_Count )
RETURN
-- create required number of rows
GENERATESERIES ( 1, Max_Count )
),
"Token", [Token]
)
VAR Jaccard =
DIVIDE (
COUNTROWS ( Tokens_Intersection ),
COUNTROWS ( Tokens_Union )
)
RETURN
Jaccard
One thing to note is that the DAX UNION function is similar to UNION ALL in SQL, so we need to modify its result to ensure each token is repeated the correct number of times (the max occurrences in any one of the strings).
Regards
Hi @jppv20
If the pairs of strings are known in advance, I would say this task more suited to Power Query. You could create a custom function for Jaccard Similarity.
However, it is interesting to look at doing this with DAX 🙂
I have attached an example with a Jaccard Similarity calculated column (which could be adapted to a measure).
I assumed that strings would be tokenized as words, delimited by spaces. Does this match your methodolgy? If not, I'm sure it can be modified to suit.
Here is the code for the calculated column:
Jaccard Similarity =
VAR String1 = Data[Name 1]
VAR String2 = Data[Name 2]
-- Tokenize strings using space
VAR Delimiter = " "
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|"
VAR String1_BarDelimited =
SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
SELECTCOLUMNS (
GENERATESERIES ( 1, String1_TokenCount ),
"Token", PATHITEM ( String1_BarDelimited, [Value] )
)
VAR String2_Tokens =
SELECTCOLUMNS (
GENERATESERIES ( 1, String2_TokenCount ),
"Token", PATHITEM ( String2_BarDelimited, [Value] )
)
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
SELECTCOLUMNS (
GENERATE (
Tokens_Union_Distinct,
VAR CurrentToken = [Token]
VAR String1_Count =
COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
VAR String2_Count =
COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
VAR Max_Count =
MAX ( String1_Count, String2_Count )
RETURN
-- create required number of rows
GENERATESERIES ( 1, Max_Count )
),
"Token", [Token]
)
VAR Jaccard =
DIVIDE (
COUNTROWS ( Tokens_Intersection ),
COUNTROWS ( Tokens_Union )
)
RETURN
Jaccard
One thing to note is that the DAX UNION function is similar to UNION ALL in SQL, so we need to modify its result to ensure each token is repeated the correct number of times (the max occurrences in any one of the strings).
Regards
Hi
Good evening.
Your solution was awesome. Thanks a lot. This formula works like magic. It has saved me lot of time and effort.
Thanks.
Karthik
Hi @eliasayyy
No problem, updated PBIX attached.
There may be some special cases to handle in both methods (empty strings etc) but this is at least a start 🙂
Code below.
Regards,
Owen
Measure: Jaccard Similarity Measure
VAR String1 = SELECTEDVALUE ( 'Name 1'[Name 1] )
VAR String2 = SELECTEDVALUE ( 'Name 2'[Name 2] )
RETURN
IF (
NOT ISBLANK ( String1 ) && NOT ISBLANK ( String2 ),
-- Tokenize strings using space
VAR Delimiter = " "
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|"
VAR String1_BarDelimited =
SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
SELECTCOLUMNS (
GENERATESERIES ( 1, String1_TokenCount ),
"Token", PATHITEM ( String1_BarDelimited, [Value] )
)
VAR String2_Tokens =
SELECTCOLUMNS (
GENERATESERIES ( 1, String2_TokenCount ),
"Token", PATHITEM ( String2_BarDelimited, [Value] )
)
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
SELECTCOLUMNS (
GENERATE (
Tokens_Union_Distinct,
VAR CurrentToken = [Token]
VAR String1_Count =
COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
VAR String2_Count =
COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
VAR Max_Count =
MAX ( String1_Count, String2_Count )
RETURN
-- create required number of rows
GENERATESERIES ( 1, Max_Count )
),
"Token", [Token]
)
VAR Jaccard =
DIVIDE (
COALESCE ( COUNTROWS ( Tokens_Intersection ), 0 ),
COUNTROWS ( Tokens_Union )
)
RETURN
Jaccard
)
Power Query function: fn_JaccardSimilarity
( String1 as text, String2 as text ) as number =>
let
String1_Tokens = Text.Split(String1, " "),
String2_Tokens = Text.Split(String2, " "),
Intersection = List.Intersect({String1_Tokens,String2_Tokens}),
Union = List.Union({String1_Tokens,String2_Tokens}),
Intersection_Count = List.Count(Intersection),
Union_Count = List.Count(Union),
Jaccard_Similarity = Intersection_Count / Union_Count
in Jaccard_Similarity
i thin you could refer to this discussion
https://community.fabric.microsoft.com/t5/Desktop/Jaccard-Index-similarity-metric-calculation-in-Pow....
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
23 | |
15 | |
12 | |
9 | |
8 |
User | Count |
---|---|
40 | |
32 | |
29 | |
12 | |
11 |