March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
hi all, I have a dataset consisting of a table with 33 columns x 30 rows. The values in each cell are text and I want to calculate the so-called Jaccard Index, a measure of similarity, for each combination of two columns. I can do this manually in Power Query but for a table with 33 columns this results in 528 comparisons so I'm hoping this could be automated somehow.
To clarify with a simplified example (see link with attached .pbix at bottom), suppose I have four fruit smoothie recipes, each with five ingredients:
The Jaccard Index is calculated as the number of values featuring in both sets (intersection), divided by the unique number across both sets (union). With 4 recipes there are 6 comparisons: 1&2, 1&3, 1&4, 2&3, 2&4 and 3&4.
So for Recipe 1&4 the Jaccard Index is 4 (i.e. Pineapple, Strawberry, Banana and Kiwi) divided by 6 = 0.667, whereas for Recipe 1&2 the Jaccard Index is 1 (i.e Strawberry) divided by 9 = 0.11
For this demo it is straightforward to calculate the Jaccard Indices for each of the 6 combinations in Power Query and store in table like this (see link to attached .pbix file at bottom):
but with my actual dataset existing of 33 columns doing this manually is not feasible.
Is there a way to automate this in Power Query or should I do something like this in, say, R / Python?
.pbix file: https://file.io/tqm0sQrFR8jS
Many thanks, Bastiaan
Solved! Go to Solution.
@BastiaanBrak see the attached solution, it is scalable to add many recipes or ingredients you add into the model.
Two parts of the solution:
Power Query created a table called Recipe and Recipe Ingredients, and there is a function that does some transformation called fnRecipe
2nd, there is a DAX measure called Jaccard Index that does the calculation based on the combination, and here is the output.
You can tweak the solution as your fit and optimize the PQ step. If something is not clear, do let me know.
✨ Follow us on LinkedIn
Check my latest blog post The Power of Using Calculation Groups with Inactive Relationships (Part 1) (perytus.com) I would ❤ Kudos if my solution helped. 👉 If you can spend time posting the question, you can also make efforts to give Kudos to whoever helped to solve your problem. It is a token of appreciation!
⚡ Visit us at https://perytus.com, your one-stop-shop for Power BI-related projects/training/consultancy.⚡
Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!
Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo
If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤
Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.
@BastiaanBrak and @parry2k I couldn't resist doing this in DAX. Assumes an unpivoted table:
Jaccard =
VAR __Table = DISTINCT('Recipe Ingredients'[Recipe])
VAR __Table1 = GENERATE(SELECTCOLUMNS(__Table,"Recipe 1",[Recipe]),SELECTCOLUMNS(__Table,"Recipe 2",[Recipe]))
VAR __Table2 = FILTER(__Table1,[Recipe 1] <> [Recipe 2])
VAR __Table3 = ADDCOLUMNS(__Table2,"1",RIGHT([Recipe 2],1),"2",RIGHT([Recipe 1],1))
VAR __Table4 = FILTER(__Table3,[1]<[2])
VAR __Table5 = ADDCOLUMNS(__Table4,"Recipe","Recipe " & [1] & "&" & [2])
VAR __Table6 = ADDCOLUMNS(__Table5,"Total",
COUNTROWS(
DISTINCT(
UNION(
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
)
)
)
)
VAR __Table7 = ADDCOLUMNS(__Table6,"Same",
COUNTROWS(
INTERSECT(
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
)
)
)
VAR __Table8 = ADDCOLUMNS(__Table7,"Jaccard Index",DIVIDE([Same],[Total],0))
RETURN
SELECTCOLUMNS(__Table8,"Recipe",[Recipe],"Jaccard Index",[Jaccard Index])
@parry2k @Greg_Deckler Thanks so much both of you!! I'm not sure if I can accept both replies as solution but will try.
@Greg_Deckler Looks good. You are creating a calculated table, which makes sense. Thanks for sharing.
The only point I want to add, in case users want to extend the functionality to slicer/dice the data, calculate table functionality will not work.
Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!
Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo
If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤
Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.
@parry2k Yeah, I went with a calculated table but in theory you could do the same thing with a measure and it would be pretty similar code overall. You would have to assume two independent slicers, etc. etc. All you would need then is the COUNTROWS from Table7 and the COUNTROWS from Table8 at that point.
Nice Power Query code BTW!
@BastiaanBrak see the attached solution, it is scalable to add many recipes or ingredients you add into the model.
Two parts of the solution:
Power Query created a table called Recipe and Recipe Ingredients, and there is a function that does some transformation called fnRecipe
2nd, there is a DAX measure called Jaccard Index that does the calculation based on the combination, and here is the output.
You can tweak the solution as your fit and optimize the PQ step. If something is not clear, do let me know.
✨ Follow us on LinkedIn
Check my latest blog post The Power of Using Calculation Groups with Inactive Relationships (Part 1) (perytus.com) I would ❤ Kudos if my solution helped. 👉 If you can spend time posting the question, you can also make efforts to give Kudos to whoever helped to solve your problem. It is a token of appreciation!
⚡ Visit us at https://perytus.com, your one-stop-shop for Power BI-related projects/training/consultancy.⚡
Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!
Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo
If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤
Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.
@BastiaanBrak and @parry2k I couldn't resist doing this in DAX. Assumes an unpivoted table:
Jaccard =
VAR __Table = DISTINCT('Recipe Ingredients'[Recipe])
VAR __Table1 = GENERATE(SELECTCOLUMNS(__Table,"Recipe 1",[Recipe]),SELECTCOLUMNS(__Table,"Recipe 2",[Recipe]))
VAR __Table2 = FILTER(__Table1,[Recipe 1] <> [Recipe 2])
VAR __Table3 = ADDCOLUMNS(__Table2,"1",RIGHT([Recipe 2],1),"2",RIGHT([Recipe 1],1))
VAR __Table4 = FILTER(__Table3,[1]<[2])
VAR __Table5 = ADDCOLUMNS(__Table4,"Recipe","Recipe " & [1] & "&" & [2])
VAR __Table6 = ADDCOLUMNS(__Table5,"Total",
COUNTROWS(
DISTINCT(
UNION(
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
)
)
)
)
VAR __Table7 = ADDCOLUMNS(__Table6,"Same",
COUNTROWS(
INTERSECT(
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 1]),"Ingredient",[Ingredient]),
SELECTCOLUMNS(FILTER('Recipe Ingredients',[Recipe] = [Recipe 2]),"Ingredient",[Ingredient])
)
)
)
VAR __Table8 = ADDCOLUMNS(__Table7,"Jaccard Index",DIVIDE([Same],[Total],0))
RETURN
SELECTCOLUMNS(__Table8,"Recipe",[Recipe],"Jaccard Index",[Jaccard Index])
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
124 | |
89 | |
84 | |
70 | |
51 |
User | Count |
---|---|
206 | |
143 | |
97 | |
79 | |
68 |