Microsoft Fabric Community Conference 2025, March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for a $400 discount.
Register nowThe Power BI DataViz World Championships are on! With four chances to enter, you could win a spot in the LIVE Grand Finale in Las Vegas. Show off your skills.
I am working with a large data model and I am utilizing strings from different dimension tables in the calculation of measures.
Now, for the sake of these calculations,
I need to generate a deterministic (same input always generate same output) unique ID (to be fed later on the measure calculation) on-the-fly based on the string values.
I am currently doing the following.
EVALUATE
var base = DATATABLE (
"Name", STRING ,
{
{ "Lorem" },
{ "Ipsum"},
{ "dolor" },
{ "sit"},
{"amet"}
}
)
var cte = ADDCOLUMNS(base, "unique_hash", HASH([Name]))
RETURN cte
My question is
Also, as far as I know (please feel free to correct me), HASH probably internally uses pigeonhole principle (n item to put on m containers). If n>m then the hash collision would occur. Therefore, if I get to know m (total number of possible HASH values) and if I can ensure the total number of n< total number of m, then the hah collision is probably avoidable. Is there any documentation available on HASH function of SSAS and what is the range of HASH (e,g, 32 bit signed integer; range 2^32 etc)?
Thank you in advance.
Solved! Go to Solution.
@AlexisOlson Thanks for this and apologies for the delayed reponse.
To sumarize,
Assuming DAX HASH is 64 bit , probability is directly proportional to k (lower k lower probability, higher k higher probability)
I probably have ~2500 string to run through HASH, so I should be good. I wonder why MS has no literature around it.
Thanks for enlighting me on pigeonhole principle but it is least of my problem as DAX HASH does not follow it.
In your example the strings are unique. Is this true in your real scenario? - yes, this table is generated internally through SUMMARIZEDCOLUMNS from the filter context
If unique, why do you need a unique ID column? - I need to use it for a tiebreaker in RANKX
Assuming DAX HASH is 64 bit , probability is directly proportional to k (lower k lower probability, higher k higher probability)
Take a look at the Wiki article again. Probability is proportional to k² rather than k.
If all you need is a tiebreaker and values are unique, you can use the string itself rather than a hash.
Assuming your string column is [Name], try
ADDCOLUMNS (
base,
"rank", RANK ( base, ORDERBY ( [Value], ASC, [Name], ASC ) )
)
With RANK, you can easily add multiple ORDERBY conditions and they don't need to be numeric.
That's right. But just in case, I need to utilize any derived on-the-fly values as below, they can't be accommodated in RANK.
RANKX is (probably) all weather from the original/derived value perspective(hence HASH). Else RANK is pretty good
var cte_1 = addcolumns(cte_0, "derived", some_dax_callback_that_generate_string)
I don't follow. You can use derived columns in the ORDERBY subfunction of RANK.
If you want to use RANKX, then I'd recommend uniquely ranking the strings (alphabetically) in your CTE instead of hashing them and then using that rank instead of the hash as your tiebreaker.
Sorry,
I was trying to use RANK on it own (which is why derived fields were not showing)
var cte1 = ADDCOLUMNS(ALLSELECTED('Table'), "string", switch(TRUE(),'Table'[row]=1,"lorem",'Table'[row]=2,"ipsum",'Table'[row]=3,"dolores"))
var cte2 = RANK()
instead of using it with ADDCOLUMNS wrapper.
With the wrapper it works perfectly and does not require HASH.
Thanks
Here's a not entirely unrelated but fun recent article.
Undergraduate Upends a 40-Year-Old Data Science Conjecture | Quanta Magazine
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Check out the February 2025 Power BI update to learn about new features.
User | Count |
---|---|
86 | |
80 | |
53 | |
39 | |
39 |
User | Count |
---|---|
104 | |
85 | |
47 | |
44 | |
43 |