Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
arpost
Advocate V
Advocate V

Are GUIDs a viable substitute for integer Identities/Primary Keys?

Greetings, community. In the world of Fabric data warehouses, there is still no way to truly create int/bigint IDENTITY columns. Some of the proposals I've seen include:

 

  1. Use Row_Number() as your IDENTITY/PRIMARY KEY and sequence your jobs in such a way that conflicts aren't created (proposed in Microsoft's own documentation as a workaround).
  2. Use GUIDs as IDENTITY/PRIMARY KEY.

 

Regarding #1, there are numerous potential issues this approach poses:

  1. Uniqueness NOT guaranteed: There is always a risk that the ROW_NUMBER() approach will produce duplicate keys if jobs happen to run too long/overlap OR if a script returns the wrong start value.
  2. Load Process Severely Impacted: This is incredibly limiting when you have numerous data feeds that should be able to run in parallel. This is complicated even more when data load jobs have dependencies on multiple tables (meaning Table C needs a PK from Table B which needs a PK from Table A) as you either have to run source files through one at a time OR build pipelines that hold up other file loads until all other inserts are complete.

 

That leads me to #2 and using GUIDs to create unique identifiers for use in data ingestion/key assignment jobs. However, I know in the world of DBs that GUIDs have historically been found to be less performant than ints/bigints as IDENTITY/PRIMARY KEY fields and also take up more space. If you have a DB with millions of records, that translates to millions of GUIDs.

 

My question is: are the performance concerns lessened now that we're in the world of Fabric and Parquet-backed DWs? Have others in the community found GUIDs to be a viable route?

6 REPLIES 6
AndyDDC
Solution Sage
Solution Sage

Well one consideration is that direct lake isn't supported when using GUIDs so that may affect your choice.  I haven't had issues with using rownumber, mostly because I don't do any parallel loading when inserting into dimensions. And with fact tables I am doing parallel loading but they are just looking up existing keys in dim tables.

 

 

Thanks for the reply, @AndyDDC. Regarding OneLake not being supported, didn't know that! Can you point me to a doc listing that limitation?

 

Regarding ROW_NUMBER() and parallel load issues, here's a fake example of our parallel loading situation. It's actually even more streamlined than our actual use case will likely end up being.

 

You have 15 clients sending data for multiple file types with different column structures but similar data:

  1. Type A: Needs to insert into three tables (Customer, Account, Order) and there are key dependencies (e.g., Account -> Customer, Order -> Account).
  2. Type B: Needs to insert into three tables (Customer, Account, Payments) and there are key dependencies (e.g., Account -> Customer, Payments-> Account).

For key assignments in Fabric, you'd need to insert into your Customer table, assign a CustomerKey, then insert into your Account table and match to a CustomerKey, assign an AccountKey, and then finish with the insert into Orders or Payments with a match to an AccountKey and assign an OrderKey/PaymentKey.

 

To achieve this in Fabric with ROW_NUMBER(), you're limited to a few options, none of which are that great:

 

OptionSummaryComment
1. Client Sequential

Run each client's files (Types A-B) through before starting next client.

Totally inefficient and makes each client feed AND file type dependent on other clients'.
2. Type SequentialRun each client's Type A files through one-by-one and then start Type B.Similarly inefficient for each client as in #1, but this at least "checks in" Type A data before starting Type B load. 
3. Client Interspersed

Run all clients' Type A files through, pausing until keys can be assigned, before running Type B.

Requires "pausing" clients' jobs at each key asssignment (CustomerKey, AccountKey, OrderKey/PaymentKey) to make sure no concurrent key assignments take place.

 

A process that you'd normally be able to run fully in parallel now has to get stretched out by several minutes if not hours. Of the choices, #3 is the "best", but that requires a hefty amount of interdependent pipeline logic and planning.

Hi @arpost I've been rolling this around my head for a few days, I think with ROW_NUMBER it won't be possible (as you've found) to fully parallelise loading.  I did think about a control table where a load could insert a "start" and "end" value, where start is the last end value inserted + 1.  Then "end" for that partcular batch would be the rowcount that would be inserted into the destination table.  Then ROW_NUMBER could be used with start.  Problem is you'd never know if 2 loads hit the table at the same time and grabbed the same "end" number.

 

I've been testing GUIDs and a NEWID value converted to varchar takes up 36 bytes, rather than int which takes up 4 bytes (2 in my screenshot as i'm using a smallint).  If a SELECT NEWID (newsequentialid not supported) is inserted into a varchar column, then direct lake would work.  In terms of space, this obviously takes up more space but the overall footprint would depend on how unique the dimensions are.  E.G I added a NEWID to a dimension table with 400 rows in and it increated the underlying parquet file size of the dimension from 11KB to 25KB (hardly anything to write home about).  I then tested with a fact table (1.1 billion rows) and added the newid value from the dimension table, it took the overall fact table (which is just a bunch of INTs) from 16MB to 24MB.  Again nothing to massive.

 

AndyDDC_0-1715088377042.png

 

So I guess it's going to be a case of seeing whether using NEWID in a varchar column is going to have much of a size and performance impact with your data.  INTs will always be faster, but perhaps in your case it may be neglible performance.

 

I can't see IDENTITY/SEQUENCE on the roadmap for Fabric Warehouses either, but it's a requested feature so I can see it being available at some point in 2024/25.

 

What's new and planned for Synapse Data Warehouse in Microsoft Fabric - Microsoft Fabric | Microsoft...

Hi @arpost 

 

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

 

Thanks.

Hi @arpost 

 

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

 

Thanks.

@arpost with regards to GUIDS and Direct Lake: "Binary and Guid semantic types are also unsupported. You must convert these data types into strings or other supported data types"

 

https://learn.microsoft.com/en-us/fabric/get-started/direct-lake-overview#known-issues-and-limitatio...

Helpful resources

Announcements
RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

MayFabricCarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.