Solved: Re: Calculate Jaccard Similarity between two text ...

jppv20 · ‎07-28-2023

Hi,

I have a table with two columns containing a text string like this:

Name 1	Name 2
Savoy Suites	Savoy Suites Hotel Apartments
NH Prague City	NH Prague City
Badhotel Scheveningen	Badhotel Scheveningen
Terme Tuhelj en Camping	Terme Tuhelj
Les Lauriers Roses	Belambra Club Les Lauriers Roses
Nuovo Natural Village	Natural Village Resort
Torfhaus Harzresort	Torfhaus Harzresort
Muenster Kongresscenter Affiliated by Melia	Hotel Münster Kongresscenter Affiliated by Meliá
Royal Azur Thalasso Golf	Royal Azur Thalasso Golf
Eperland	Hotel Eperland

Is it possible creating a new column (or using a measure) to calculate the Jaccard Index for these two strings?

Thanks in advance!

Best,

Jori

@Greg_Deckler

OwenAuger · ‎07-28-2023

Hi @jppv20

If the pairs of strings are known in advance, I would say this task more suited to Power Query. You could create a custom function for Jaccard Similarity.

However, it is interesting to look at doing this with DAX 🙂

I have attached an example with a Jaccard Similarity calculated column (which could be adapted to a measure).

I assumed that strings would be tokenized as words, delimited by spaces. Does this match your methodolgy? If not, I'm sure it can be modified to suit.

Here is the code for the calculated column:

Jaccard Similarity = 
VAR String1 = Data[Name 1]
VAR String2 = Data[Name 2]
-- Tokenize strings using space
VAR Delimiter = " " 
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|" 
VAR String1_BarDelimited =
    SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
     PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
    SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
     PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String1_TokenCount ),
        "Token", PATHITEM ( String1_BarDelimited, [Value] )
    )
VAR String2_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String2_TokenCount ),
        "Token", PATHITEM ( String2_BarDelimited, [Value] )
    )
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
    INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
    DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
    SELECTCOLUMNS (
        GENERATE (
            Tokens_Union_Distinct,
            VAR CurrentToken = [Token]
            VAR String1_Count =
                COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
            VAR String2_Count =
                COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
            VAR Max_Count =
                MAX ( String1_Count, String2_Count )
            RETURN
                -- create required number of rows
                GENERATESERIES ( 1, Max_Count ) 
        ),
        "Token", [Token]
    )     
VAR Jaccard =
    DIVIDE (
        COUNTROWS ( Tokens_Intersection ),
        COUNTROWS ( Tokens_Union )
    )
RETURN
    Jaccard

One thing to note is that the DAX UNION function is similar to UNION ALL in SQL, so we need to modify its result to ensure each token is repeated the correct number of times (the max occurrences in any one of the strings).

Regards

Owen Auger
Did I answer your question? Mark my post as a solution!
Blog
LinkedIn

View solution in original post

OwenAuger · ‎07-28-2023

Hi @jppv20

If the pairs of strings are known in advance, I would say this task more suited to Power Query. You could create a custom function for Jaccard Similarity.

However, it is interesting to look at doing this with DAX 🙂

I have attached an example with a Jaccard Similarity calculated column (which could be adapted to a measure).

I assumed that strings would be tokenized as words, delimited by spaces. Does this match your methodolgy? If not, I'm sure it can be modified to suit.

Here is the code for the calculated column:

Jaccard Similarity = 
VAR String1 = Data[Name 1]
VAR String2 = Data[Name 2]
-- Tokenize strings using space
VAR Delimiter = " " 
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|" 
VAR String1_BarDelimited =
    SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
     PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
    SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
     PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String1_TokenCount ),
        "Token", PATHITEM ( String1_BarDelimited, [Value] )
    )
VAR String2_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String2_TokenCount ),
        "Token", PATHITEM ( String2_BarDelimited, [Value] )
    )
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
    INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
    DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
    SELECTCOLUMNS (
        GENERATE (
            Tokens_Union_Distinct,
            VAR CurrentToken = [Token]
            VAR String1_Count =
                COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
            VAR String2_Count =
                COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
            VAR Max_Count =
                MAX ( String1_Count, String2_Count )
            RETURN
                -- create required number of rows
                GENERATESERIES ( 1, Max_Count ) 
        ),
        "Token", [Token]
    )     
VAR Jaccard =
    DIVIDE (
        COUNTROWS ( Tokens_Intersection ),
        COUNTROWS ( Tokens_Union )
    )
RETURN
    Jaccard

One thing to note is that the DAX UNION function is similar to UNION ALL in SQL, so we need to modify its result to ensure each token is repeated the correct number of times (the max occurrences in any one of the strings).

Regards

Owen Auger
Did I answer your question? Mark my post as a solution!
Blog
LinkedIn

Karthik_s3 · ‎11-20-2023

Hi

Good evening.

Your solution was awesome. Thanks a lot. This formula works like magic. It has saved me lot of time and effort.

Thanks.

Karthik

eliasayyy · ‎07-28-2023

@OwenAuger can you please sahre how this might be done in power query

OwenAuger · ‎07-29-2023

Hi @eliasayyy

No problem, updated PBIX attached.

The measure is quite similar to the calculated column. However, to provide input to the measure, I created two disconnected tables containing strings ('Name 1' and 'Name 2').
For Power Query I created a function. The code is significantly shorter than the DAX version, since Power Query has more convenient text-splitting functions, and the List.Union function gives us the result we need directly.

There may be some special cases to handle in both methods (empty strings etc) but this is at least a start 🙂

Code below.

Regards,

Owen

Measure: Jaccard Similarity Measure

VAR String1 = SELECTEDVALUE ( 'Name 1'[Name 1] )
VAR String2 = SELECTEDVALUE ( 'Name 2'[Name 2] )
RETURN
IF (
    NOT ISBLANK ( String1 ) && NOT ISBLANK ( String2 ),
    -- Tokenize strings using space
    VAR Delimiter = " " 
    -- Convert to vertical bar delimiter to allow path functions
    VAR Bar = "|" 
    VAR String1_BarDelimited =
        SUBSTITUTE ( String1, Delimiter, Bar )
    VAR String1_TokenCount =
        PATHLENGTH ( String1_BarDelimited )
    VAR String2_BarDelimited =
        SUBSTITUTE ( String2, Delimiter, Bar )
    VAR String2_TokenCount =
        PATHLENGTH ( String2_BarDelimited )
    VAR String1_Tokens =
        SELECTCOLUMNS (
            GENERATESERIES ( 1, String1_TokenCount ),
            "Token", PATHITEM ( String1_BarDelimited, [Value] )
        )
    VAR String2_Tokens =
        SELECTCOLUMNS (
            GENERATESERIES ( 1, String2_TokenCount ),
            "Token", PATHITEM ( String2_BarDelimited, [Value] )
        )
    -- DAX INTERSECT function preserves duplicates so is suitable to find intersection
    VAR Tokens_Intersection =
        INTERSECT ( String1_Tokens, String2_Tokens )
    -- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
    -- First get distinct union of tokens from both strings
    VAR Tokens_Union_Distinct =
        DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
    -- Then repeat each token the max number of times it appears in each string
    VAR Tokens_Union =
        SELECTCOLUMNS (
            GENERATE (
                Tokens_Union_Distinct,
                VAR CurrentToken = [Token]
                VAR String1_Count =
                    COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
                VAR String2_Count =
                    COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
                VAR Max_Count =
                    MAX ( String1_Count, String2_Count )
                RETURN
                    -- create required number of rows
                    GENERATESERIES ( 1, Max_Count ) 
            ),
            "Token", [Token]
        )
                
    VAR Jaccard =
        DIVIDE (
            COALESCE ( COUNTROWS ( Tokens_Intersection ), 0 ),
            COUNTROWS ( Tokens_Union )
        )
    RETURN
        Jaccard
)

Power Query function: fn_JaccardSimilarity

( String1 as text, String2 as text ) as number =>
let
    String1_Tokens = Text.Split(String1, " "),
    String2_Tokens = Text.Split(String2, " "),
    Intersection = List.Intersect({String1_Tokens,String2_Tokens}),
    Union = List.Union({String1_Tokens,String2_Tokens}),
    Intersection_Count = List.Count(Intersection),
    Union_Count = List.Count(Union),
    Jaccard_Similarity = Intersection_Count / Union_Count
in Jaccard_Similarity

Owen Auger
Did I answer your question? Mark my post as a solution!
Blog
LinkedIn

eliasayyy · ‎07-29-2023

@OwenAuger thank you very much

eliasayyy · ‎07-28-2023

i thin you could refer to this discussion

https://community.fabric.microsoft.com/t5/Desktop/Jaccard-Index-similarity-metric-calculation-in-Pow....

Calculate Jaccard Similarity between two text strings

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Join us at FabCon Vienna from September 15-18, 2025

Calculate Jaccard Similarity between two text strings

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025