Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI DataViz World Championships are on! With four chances to enter, you could win a spot in the LIVE Grand Finale in Las Vegas. Show off your skills.

Reply
smpa01
Super User
Super User

Deterministic Unique IDs

I am working with a large data model and I am utilizing strings from different dimension tables in the calculation of measures.

 

Now, for the sake of these calculations,

I need to generate a deterministic (same input always generate same output) unique ID (to be fed later on the measure calculation) on-the-fly based on the string values.

 

I am currently doing the following.

 

EVALUATE
var base = DATATABLE (
    "Name", STRING ,   
    {
        { "Lorem" },
        { "Ipsum"},
        { "dolor" },
		{ "sit"},
		{"amet"}
    }
)
var cte = ADDCOLUMNS(base, "unique_hash", HASH([Name]))
RETURN cte

 

My question is

  1. is it at all possible for HASH to generate repeated values (unique_hash) for different strings (collision). (I am working with a large number of strings coming from different dim tables)
  2. if yes, what is the best possible way to achieve the desired end result using DAX (DAX is the only option)
  3. I refrained from using ROWNUMBER as it can't be utilized for derived table.

HASH 

 

Also, as far as I know (please feel free to correct me), HASH probably internally uses pigeonhole principle (n item to put on m containers). If n>m then the hash collision would occur. Therefore, if I get to know m  (total number of possible HASH values) and if I can ensure the total number of n< total number of m, then the hah collision is probably avoidable. Is there any documentation available on HASH function of SSAS and what is the range of HASH (e,g, 32 bit signed integer; range 2^32 etc)?

 

 

Thank you in advance.

@AlexisOlson @jeffrey_wang 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs
1 ACCEPTED SOLUTION
AlexisOlson
Super User
Super User

  1. In your example the strings are unique. Is this true in your real scenario?
    • If not unique, please verify whether or not duplicate rows should have matching unique ID values.
    • If unique, why do you need a unique ID column?
  2. The Pigeonhole Principle is not really part of a hash algorithm. It's a mathematical theorem that states that if you have more than N items to put into N containers, at least one container will contain multiple items (i.e. a hash collision). You are likely to get hash collisions long before you approach the number of possible hash outputs. It's just that if you go over, you're mathematically guaranteed to have them.
  3. It looks like the HASH function outputs values in the range of about ±9.2 x 1018. This is roughly the size of a 64-bit signed integer assuming it uses that full range.
  4. Check out the probability of random collisions table on Wikipedia to get an idea of how likely a collision is. For example, assuming any hash output has equal probability, and the HASH function has an output on the order of 64 bits, then you can have up to around half a billion input values with a <1% probability of having a random hash collision. Or use this tool
     AlexisOlson_0-1738361667900.png

     

View solution in original post

7 REPLIES 7
AlexisOlson
Super User
Super User

  1. In your example the strings are unique. Is this true in your real scenario?
    • If not unique, please verify whether or not duplicate rows should have matching unique ID values.
    • If unique, why do you need a unique ID column?
  2. The Pigeonhole Principle is not really part of a hash algorithm. It's a mathematical theorem that states that if you have more than N items to put into N containers, at least one container will contain multiple items (i.e. a hash collision). You are likely to get hash collisions long before you approach the number of possible hash outputs. It's just that if you go over, you're mathematically guaranteed to have them.
  3. It looks like the HASH function outputs values in the range of about ±9.2 x 1018. This is roughly the size of a 64-bit signed integer assuming it uses that full range.
  4. Check out the probability of random collisions table on Wikipedia to get an idea of how likely a collision is. For example, assuming any hash output has equal probability, and the HASH function has an output on the order of 64 bits, then you can have up to around half a billion input values with a <1% probability of having a random hash collision. Or use this tool
     AlexisOlson_0-1738361667900.png

     

@AlexisOlson  Thanks for this and apologies for the delayed reponse.

 

To sumarize,

Assuming DAX HASH is 64 bit , probability is directly proportional to k (lower k lower probability, higher k higher probability)

 

I probably have ~2500 string to run through HASH, so I should be good. I wonder why MS has no literature around it.

 

Thanks for enlighting me on pigeonhole principle but it is least of my problem as DAX HASH does not follow it.

 

In your example the strings are unique. Is this true in your real scenario? - yes, this table is generated internally through SUMMARIZEDCOLUMNS from the filter context

 

If unique, why do you need a unique ID column? - I need to use it for a tiebreaker in RANKX

 

smpa01_0-1739464481995.png

 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs

Assuming DAX HASH is 64 bit , probability is directly proportional to k (lower k lower probability, higher k higher probability)

Take a look at the Wiki article again. Probability is proportional to k² rather than k.


If all you need is a tiebreaker and values are unique, you can use the string itself rather than a hash.


Assuming your string column is [Name], try

 

ADDCOLUMNS (
    base,
    "rank", RANK ( base, ORDERBY ( [Value], ASC, [Name], ASC ) )
)

 

With RANK, you can easily add multiple ORDERBY conditions and they don't need to be numeric.

That's right. But just in case, I need to utilize any derived on-the-fly values as below, they can't be accommodated in RANK.

 

RANKX is (probably) all weather from the original/derived value perspective(hence HASH). Else RANK is pretty good

var cte_1 = addcolumns(cte_0, "derived", some_dax_callback_that_generate_string)

 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs

I don't follow. You can use derived columns in the ORDERBY subfunction of RANK.

If you want to use RANKX, then I'd recommend uniquely ranking the strings (alphabetically) in your CTE instead of hashing them and then using that rank instead of the hash as your tiebreaker.

Sorry,

I was trying to use RANK on it own (which is why derived fields were not showing)

var cte1 = ADDCOLUMNS(ALLSELECTED('Table'), "string", switch(TRUE(),'Table'[row]=1,"lorem",'Table'[row]=2,"ipsum",'Table'[row]=3,"dolores"))
var cte2 = RANK()

instead of using it with ADDCOLUMNS wrapper.

 

With the wrapper it works perfectly and does not require HASH.

 

Thanks

 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs

Here's a not entirely unrelated but fun recent article.

 

Undergraduate Upends a 40-Year-Old Data Science Conjecture | Quanta Magazine

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

FebPBI_Carousel

Power BI Monthly Update - February 2025

Check out the February 2025 Power BI update to learn about new features.

Feb2025 NL Carousel

Fabric Community Update - February 2025

Find out what's new and trending in the Fabric community.