Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
jppv20
Solution Sage
Solution Sage

Calculate Jaccard Similarity between two text strings

Hi,

 

I have a table with two columns containing a text string like this:

Name 1Name 2
Savoy SuitesSavoy Suites Hotel Apartments
NH Prague CityNH Prague City
Badhotel ScheveningenBadhotel Scheveningen
Terme Tuhelj en CampingTerme Tuhelj
Les Lauriers RosesBelambra Club Les Lauriers Roses
Nuovo Natural VillageNatural Village Resort
Torfhaus HarzresortTorfhaus Harzresort
Muenster Kongresscenter Affiliated by MeliaHotel Münster Kongresscenter Affiliated by Meliá
Royal Azur Thalasso GolfRoyal Azur Thalasso Golf
EperlandHotel Eperland

 

Is it possible creating a new column (or using a measure) to calculate the Jaccard Index for these two strings?

 

Thanks in advance!

 

Best,

Jori


@Greg_Deckler 

1 ACCEPTED SOLUTION
OwenAuger
Super User
Super User

Hi @jppv20 

If the pairs of strings are known in advance, I would say this task more suited to Power Query. You could create a custom function for Jaccard Similarity.

 

However, it is interesting to look at doing this with DAX 🙂

 

I have attached an example with a Jaccard Similarity calculated column (which could be adapted to a measure).

 

I assumed that strings would be tokenized as words, delimited by spaces.  Does this match your methodolgy? If not, I'm sure it can be modified to suit.

 

Here is the code for the calculated column:

Jaccard Similarity = 
VAR String1 = Data[Name 1]
VAR String2 = Data[Name 2]
-- Tokenize strings using space
VAR Delimiter = " " 
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|" 
VAR String1_BarDelimited =
    SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
     PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
    SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
     PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String1_TokenCount ),
        "Token", PATHITEM ( String1_BarDelimited, [Value] )
    )
VAR String2_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String2_TokenCount ),
        "Token", PATHITEM ( String2_BarDelimited, [Value] )
    )
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
    INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
    DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
    SELECTCOLUMNS (
        GENERATE (
            Tokens_Union_Distinct,
            VAR CurrentToken = [Token]
            VAR String1_Count =
                COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
            VAR String2_Count =
                COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
            VAR Max_Count =
                MAX ( String1_Count, String2_Count )
            RETURN
                -- create required number of rows
                GENERATESERIES ( 1, Max_Count ) 
        ),
        "Token", [Token]
    )     
VAR Jaccard =
    DIVIDE (
        COUNTROWS ( Tokens_Intersection ),
        COUNTROWS ( Tokens_Union )
    )
RETURN
    Jaccard

One thing to note is that the DAX UNION function is similar to UNION ALL in SQL, so we need to modify its result to ensure each token is repeated the correct number of times (the max occurrences in any one of the strings).

OwenAuger_0-1690543905195.png

 

Regards


Owen Auger
Did I answer your question? Mark my post as a solution!
Blog
Twitter
LinkedIn

View solution in original post

6 REPLIES 6
OwenAuger
Super User
Super User

Hi @jppv20 

If the pairs of strings are known in advance, I would say this task more suited to Power Query. You could create a custom function for Jaccard Similarity.

 

However, it is interesting to look at doing this with DAX 🙂

 

I have attached an example with a Jaccard Similarity calculated column (which could be adapted to a measure).

 

I assumed that strings would be tokenized as words, delimited by spaces.  Does this match your methodolgy? If not, I'm sure it can be modified to suit.

 

Here is the code for the calculated column:

Jaccard Similarity = 
VAR String1 = Data[Name 1]
VAR String2 = Data[Name 2]
-- Tokenize strings using space
VAR Delimiter = " " 
-- Convert to vertical bar delimiter to allow path functions
VAR Bar = "|" 
VAR String1_BarDelimited =
    SUBSTITUTE ( String1, Delimiter, Bar )
VAR String1_TokenCount =
     PATHLENGTH ( String1_BarDelimited )
VAR String2_BarDelimited =
    SUBSTITUTE ( String2, Delimiter, Bar )
VAR String2_TokenCount =
     PATHLENGTH ( String2_BarDelimited )
VAR String1_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String1_TokenCount ),
        "Token", PATHITEM ( String1_BarDelimited, [Value] )
    )
VAR String2_Tokens =
    SELECTCOLUMNS (
        GENERATESERIES ( 1, String2_TokenCount ),
        "Token", PATHITEM ( String2_BarDelimited, [Value] )
    )
-- DAX INTERSECT function preserves duplicates so is suitable to find intersection
VAR Tokens_Intersection =
    INTERSECT ( String1_Tokens, String2_Tokens )
-- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
-- First get distinct union of tokens from both strings
VAR Tokens_Union_Distinct =
    DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
-- Then repeat each token the max number of times it appears in each string
VAR Tokens_Union =
    SELECTCOLUMNS (
        GENERATE (
            Tokens_Union_Distinct,
            VAR CurrentToken = [Token]
            VAR String1_Count =
                COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
            VAR String2_Count =
                COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
            VAR Max_Count =
                MAX ( String1_Count, String2_Count )
            RETURN
                -- create required number of rows
                GENERATESERIES ( 1, Max_Count ) 
        ),
        "Token", [Token]
    )     
VAR Jaccard =
    DIVIDE (
        COUNTROWS ( Tokens_Intersection ),
        COUNTROWS ( Tokens_Union )
    )
RETURN
    Jaccard

One thing to note is that the DAX UNION function is similar to UNION ALL in SQL, so we need to modify its result to ensure each token is repeated the correct number of times (the max occurrences in any one of the strings).

OwenAuger_0-1690543905195.png

 

Regards


Owen Auger
Did I answer your question? Mark my post as a solution!
Blog
Twitter
LinkedIn

Hi 

Good evening.  

Your solution was awesome.  Thanks a lot.  This formula works like magic.  It has saved me lot of time and effort. 

Thanks. 

Karthik 

@OwenAuger can you please sahre how this might be done in power query 

Hi @eliasayyy 

No problem, updated PBIX attached.

 

  1. The measure is quite similar to the calculated column. However, to provide input to the measure, I created two disconnected tables containing strings ('Name 1' and 'Name 2').
  2. For Power Query I created a function. The code is significantly shorter than the DAX version, since Power Query has more convenient text-splitting functions, and the List.Union function gives us the result we need directly.

There may be some special cases to handle in both methods (empty strings etc) but this is at least a start 🙂

 

Code below.

Regards,

Owen

 

Measure: Jaccard Similarity Measure

VAR String1 = SELECTEDVALUE ( 'Name 1'[Name 1] )
VAR String2 = SELECTEDVALUE ( 'Name 2'[Name 2] )
RETURN
IF (
    NOT ISBLANK ( String1 ) && NOT ISBLANK ( String2 ),
    -- Tokenize strings using space
    VAR Delimiter = " " 
    -- Convert to vertical bar delimiter to allow path functions
    VAR Bar = "|" 
    VAR String1_BarDelimited =
        SUBSTITUTE ( String1, Delimiter, Bar )
    VAR String1_TokenCount =
        PATHLENGTH ( String1_BarDelimited )
    VAR String2_BarDelimited =
        SUBSTITUTE ( String2, Delimiter, Bar )
    VAR String2_TokenCount =
        PATHLENGTH ( String2_BarDelimited )
    VAR String1_Tokens =
        SELECTCOLUMNS (
            GENERATESERIES ( 1, String1_TokenCount ),
            "Token", PATHITEM ( String1_BarDelimited, [Value] )
        )
    VAR String2_Tokens =
        SELECTCOLUMNS (
            GENERATESERIES ( 1, String2_TokenCount ),
            "Token", PATHITEM ( String2_BarDelimited, [Value] )
        )
    -- DAX INTERSECT function preserves duplicates so is suitable to find intersection
    VAR Tokens_Intersection =
        INTERSECT ( String1_Tokens, String2_Tokens )
    -- DAX UNION function not suitable as it is equivalent to SQL UNION ALL
    -- First get distinct union of tokens from both strings
    VAR Tokens_Union_Distinct =
        DISTINCT ( UNION ( String1_Tokens, String2_Tokens ) )
    -- Then repeat each token the max number of times it appears in each string
    VAR Tokens_Union =
        SELECTCOLUMNS (
            GENERATE (
                Tokens_Union_Distinct,
                VAR CurrentToken = [Token]
                VAR String1_Count =
                    COUNTROWS ( FILTER ( String1_Tokens, [Token] = CurrentToken ) )
                VAR String2_Count =
                    COUNTROWS ( FILTER ( String2_Tokens, [Token] = CurrentToken ) )
                VAR Max_Count =
                    MAX ( String1_Count, String2_Count )
                RETURN
                    -- create required number of rows
                    GENERATESERIES ( 1, Max_Count ) 
            ),
            "Token", [Token]
        )
                
    VAR Jaccard =
        DIVIDE (
            COALESCE ( COUNTROWS ( Tokens_Intersection ), 0 ),
            COUNTROWS ( Tokens_Union )
        )
    RETURN
        Jaccard
)

 

Power Query function: fn_JaccardSimilarity

( String1 as text, String2 as text ) as number =>
let
    String1_Tokens = Text.Split(String1, " "),
    String2_Tokens = Text.Split(String2, " "),
    Intersection = List.Intersect({String1_Tokens,String2_Tokens}),
    Union = List.Union({String1_Tokens,String2_Tokens}),
    Intersection_Count = List.Count(Intersection),
    Union_Count = List.Count(Union),
    Jaccard_Similarity = Intersection_Count / Union_Count
in Jaccard_Similarity

 


Owen Auger
Did I answer your question? Mark my post as a solution!
Blog
Twitter
LinkedIn

@OwenAuger thank you very much

eliasayyy
Super User
Super User

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.