Solved: Re: Deterministic Unique IDs

smpa01 · ‎01-31-2025

I am working with a large data model and I am utilizing strings from different dimension tables in the calculation of measures.

Now, for the sake of these calculations,

I need to generate a deterministic (same input always generate same output) unique ID (to be fed later on the measure calculation) on-the-fly based on the string values.

I am currently doing the following.

EVALUATE
var base = DATATABLE (
    "Name", STRING ,   
    {
        { "Lorem" },
        { "Ipsum"},
        { "dolor" },
		{ "sit"},
		{"amet"}
    }
)
var cte = ADDCOLUMNS(base, "unique_hash", HASH([Name]))
RETURN cte

My question is

is it at all possible for HASH to generate repeated values (unique_hash) for different strings (collision). (I am working with a large number of strings coming from different dim tables)
if yes, what is the best possible way to achieve the desired end result using DAX (DAX is the only option)
I refrained from using ROWNUMBER as it can't be utilized for derived table.

HASH

Also, as far as I know (please feel free to correct me), HASH probably internally uses pigeonhole principle (n item to put on m containers). If n>m then the hash collision would occur. Therefore, if I get to know m (total number of possible HASH values) and if I can ensure the total number of n< total number of m, then the hah collision is probably avoidable. Is there any documentation available on HASH function of SSAS and what is the range of HASH (e,g, 32 bit signed integer; range 2^32 etc)?

Thank you in advance.

@AlexisOlson @jeffrey_wang

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

AlexisOlson · ‎01-31-2025

In your example the strings are unique. Is this true in your real scenario?
- If not unique, please verify whether or not duplicate rows should have matching unique ID values.
- If unique, why do you need a unique ID column?
The Pigeonhole Principle is not really part of a hash algorithm. It's a mathematical theorem that states that if you have more than N items to put into N containers, at least one container will contain multiple items (i.e. a hash collision). You are likely to get hash collisions long before you approach the number of possible hash outputs. It's just that if you go over, you're mathematically guaranteed to have them.
It looks like the HASH function outputs values in the range of about ±9.2 x 10¹⁸. This is roughly the size of a 64-bit signed integer assuming it uses that full range.
Check out the probability of random collisions table on Wikipedia to get an idea of how likely a collision is. For example, assuming any hash output has equal probability, and the HASH function has an output on the order of 64 bits, then you can have up to around half a billion input values with a <1% probability of having a random hash collision. Or use this tool

View solution in original post

AlexisOlson · ‎01-31-2025

In your example the strings are unique. Is this true in your real scenario?
- If not unique, please verify whether or not duplicate rows should have matching unique ID values.
- If unique, why do you need a unique ID column?
The Pigeonhole Principle is not really part of a hash algorithm. It's a mathematical theorem that states that if you have more than N items to put into N containers, at least one container will contain multiple items (i.e. a hash collision). You are likely to get hash collisions long before you approach the number of possible hash outputs. It's just that if you go over, you're mathematically guaranteed to have them.
It looks like the HASH function outputs values in the range of about ±9.2 x 10¹⁸. This is roughly the size of a 64-bit signed integer assuming it uses that full range.
Check out the probability of random collisions table on Wikipedia to get an idea of how likely a collision is. For example, assuming any hash output has equal probability, and the HASH function has an output on the order of 64 bits, then you can have up to around half a billion input values with a <1% probability of having a random hash collision. Or use this tool

smpa01 · ‎02-13-2025

@AlexisOlson Thanks for this and apologies for the delayed reponse.

To sumarize,

Assuming DAX HASH is 64 bit , probability is directly proportional to k (lower k lower probability, higher k higher probability)

I probably have ~2500 string to run through HASH, so I should be good. I wonder why MS has no literature around it.

Thanks for enlighting me on pigeonhole principle but it is least of my problem as DAX HASH does not follow it.

In your example the strings are unique. Is this true in your real scenario? - yes, this table is generated internally through SUMMARIZEDCOLUMNS from the filter context

If unique, why do you need a unique ID column? - I need to use it for a tiebreaker in RANKX

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

AlexisOlson · ‎02-13-2025

Assuming DAX HASH is 64 bit , probability is directly proportional to k (lower k lower probability, higher k higher probability)

Take a look at the Wiki article again. Probability is proportional to k² rather than k.

If all you need is a tiebreaker and values are unique, you can use the string itself rather than a hash.

Assuming your string column is [Name], try

ADDCOLUMNS (
    base,
    "rank", RANK ( base, ORDERBY ( [Value], ASC, [Name], ASC ) )
)

With RANK, you can easily add multiple ORDERBY conditions and they don't need to be numeric.

smpa01 · ‎02-14-2025

That's right. But just in case, I need to utilize any derived on-the-fly values as below, they can't be accommodated in RANK.

RANKX is (probably) all weather from the original/derived value perspective(hence HASH). Else RANK is pretty good

var cte_1 = addcolumns(cte_0, "derived", some_dax_callback_that_generate_string)

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

AlexisOlson · ‎02-14-2025

I don't follow. You can use derived columns in the ORDERBY subfunction of RANK.

If you want to use RANKX, then I'd recommend uniquely ranking the strings (alphabetically) in your CTE instead of hashing them and then using that rank instead of the hash as your tiebreaker.

smpa01 · ‎02-14-2025

Sorry,

I was trying to use RANK on it own (which is why derived fields were not showing)

var cte1 = ADDCOLUMNS(ALLSELECTED('Table'), "string", switch(TRUE(),'Table'[row]=1,"lorem",'Table'[row]=2,"ipsum",'Table'[row]=3,"dolores"))
var cte2 = RANK()

instead of using it with ADDCOLUMNS wrapper.

With the wrapper it works perfectly and does not require HASH.

Thanks

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

lbendlin · ‎02-12-2025

Here's a not entirely unrelated but fun recent article.

Undergraduate Upends a 40-Year-Old Data Science Conjecture | Quanta Magazine

Deterministic Unique IDs

Helpful resources

Join us at the Microsoft Fabric Community Conference

Power BI Monthly Update - February 2025

Fabric Community Update - February 2025

How to Get Your Question Answered Quickly

Join us at the 2025 Microsoft Fabric Community Conference

Deterministic Unique IDs

Helpful resources

Join us at the Microsoft Fabric Community Conference

Power BI Monthly Update - February 2025

Fabric Community Update - February 2025

How to Get Your Question Answered Quickly