Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!To celebrate FabCon Vienna, we are offering 50% off select exams. Ends October 3rd. Request your discount now.
I have learned how to create a surrogate key when importing data into warehouse using SQL query, but coming from a Power BI background, I don't see the benefits of creating surrogate key since one can not use it in search or relationship building.
I have read a few articles about surrogate key, but none of them persuade me into believeing surrogate key is critical. It seems to me the surrogate key is only meaningful when it comes timestamp information.Won't it cause problem that the surrogate key is changed each time the data gets reloaded to the warehouse. Can someone gives me a few example when surrogate key is necessary?
Solved! Go to Solution.
Hi @Jeanxyz,
As I mentioned these get into some pretty advanced data engineering concepts. If you're able to work without them and and you're happy, then continue being happy.
Here's a different example for where I'm using SKs:
I am building a dim_user table for my org, and I'm collecting various things: a user's Azure UUID, UPN, and legacy powershell PUID, and name.
One of the main use cases for this table is to use with Fabric audit data from the REST APIs, which identify users by UPN or PUID, not UUID.
To add a layer of complexity, PUID is not exposed to microsoft customers anymore.
My process is to first extract users from Azure using the Graph API, which gives me a list of UUIDs and UPNs and names.
I am then combing through the audit data to find records where PUID is populated as well as UPN so I can link in PUIDs - this results in not a complete list of PUIDs, but a workable list for active users in Fabric.
In my org, UPNs can change, which creates a challenge as the fabric audit data uses the UPN as the key to identify who caused the action.
I also don't want to need to fully reload my fact tables that depend on this dim_user table every night.
To achieve this, when I build my dim_user table, I am greating SKs on the fly, and saving them to a SK mapping table that maps Azure UUIDs to the generated SK.
Every night when I rebuild my dim_user table, using that maping table I am assigning users to the same SK every run so as to keep my fact table valid. This ensures that if a UPN changes, that user's record gets updated with the new UPN, but it maps to the same SK as their old UPN, and the audit data is still pointing to the correct user.
I am then able to incrementally update my fact tables without a worry that a changed UPN will make my audit data link to invalid users.
I hope this helps,
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution
@tayloramy Thanks for sharing the use case. I start to get the gist. So the SK here makes sure each user is mapped to one SK. That means oen can use SK to create a clean logic when properties like UPN, PUID can change over time.
However, I'm still a bit confused, according to MS documentation, each dimension table must have a unique identifier column, but the SK in dim_user is not a unique identifier as two rows can be mapped to the same SK due to UPN change. The real dimension table is the UUID-SK mapping, but since UUID is unique for each user, why not simply use UUID as the primary key?
One last question, do you use SK in joining tables or building relationships? It seems to me each SK represents one unique user, hence it can be used in building relationships.
Hi @Jeanxyz,
THe SKs are unique, when a UPN changes, the new UPN is mapped to the SK of the old UPN using the user's Azure UUID and the mapping table.
So the SK represents the account as a whole. 1 SK per account, an account will never have 2 SKs regardless of the change of their UPN.
Because the SKs are primary keys on the tables, I do use them to join the tables - that is their intended purpose. I then usually hide the SKs from users in PowrBI reports as they ahve no need to see this system generated number.
Depending on the dataset, I sometimes mnake a 64 bit hash of the SK to use as a join key for performance reasons, but it's the same concept.
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.
Hi @Jeanxyz ,
Thanks for reaching out to the Microsoft fabric community forum.
Thanks for your prompt response
I wanted to follow up and confirm whether you’ve had the opportunity to review the information provided by @tayloramy . If you have any questions or need further clarification, please don’t hesitate to reach out.
We appreciate your engagement and thank you for being an active part of the community.
Best regards,
Lakshmi.
Hi @Jeanxyz,
Why surrogate keys matter (even if you build in Power BI)
Surrogate keys (SKs) are warehouse-generated IDs that identify each row version of a dimension. They're not for end-user search. They exist to make your star schema stable, performant, and historically correct. Microsoft's guidance and Kimball's dimensional modeling both recommend them for analytic models. Microsoft: Star schema & surrogate keys, Kimball Group: Surrogate keys.
What surrogate keys give you
"But I can't use SKs for search or relationships in Power BI."
End users shouldn't search on SKs; you hide them. But your relationships should still use SKs when you have SCD Type 2 or conformed dimensions. That ensures each fact connects to the correct historical row. Microsoft's star-schema guidance assumes SK-based dimensional modeling even if the report only shows business attributes. Microsoft: Star schema guidance.
"Won't SKs change every reload?"
They must not. If SKs are changing on reload, that's a loading anti-pattern (e.g., truncating and re-inserting the dimension). The fix is to upsert with a business key (BK) lookup so existing members keep their SK, and only SCD-relevant changes create a new SK (and close the old one's date range). Microsoft: Fact processing with SK lookups, Kimball: SCD2 mechanics.
Minimal example
-- Dimension (note the surrogate key and the business key) CREATE TABLE dbo.DimCustomer ( CustomerSK INT IDENTITY(1,1) NOT NULL PRIMARY KEY, CustomerBK NVARCHAR(50) NOT NULL, -- natural/business key from source Name NVARCHAR(200) NOT NULL, Segment NVARCHAR(50) NOT NULL, EffectiveStart DATETIME2(3) NOT NULL, EffectiveEnd DATETIME2(3) NOT NULL, IsCurrent BIT NOT NULL ); CREATE UNIQUE INDEX UX_DimCustomer_Current ON dbo.DimCustomer(CustomerBK, IsCurrent) WHERE IsCurrent = 1; -- Stage holds latest snapshot from source system -- 1) Close out rows that changed (SCD2 detect) UPDATE d SET EffectiveEnd = SYSUTCDATETIME(), IsCurrent = 0 FROM dbo.DimCustomer d JOIN dbo.StgCustomer s ON s.CustomerBK = d.CustomerBK WHERE d.IsCurrent = 1 AND (d.Name <> s.Name OR d.Segment <> s.Segment); -- 2) Insert new current versions (new SKs) for new or changed BKs INSERT dbo.DimCustomer (CustomerBK, Name, Segment, EffectiveStart, EffectiveEnd, IsCurrent) SELECT s.CustomerBK, s.Name, s.Segment, SYSUTCDATETIME(), '9999-12-31', 1 FROM dbo.StgCustomer s LEFT JOIN dbo.DimCustomer d ON d.CustomerBK = s.CustomerBK AND d.IsCurrent = 1 WHERE d.CustomerBK IS NULL OR d.Name <> s.Name OR d.Segment <> s.Segment; -- Fact load: look up the SK for the BK that was current on the fact date INSERT dbo.FactSales (CustomerSK, SaleDate, Amount) SELECT d.CustomerSK, f.SaleDate, f.Amount FROM dbo.StgSales f JOIN dbo.DimCustomer d ON d.CustomerBK = f.CustomerBK AND d.IsCurrent = 1 AND f.SaleDate >= d.EffectiveStart AND f.SaleDate < d.EffectiveEnd;
When you might skip SKs
If your dimension is tiny, static, single-sourced, and you don't need history (no SCD), you can get by with a durable natural key. But once you need history, conformance across systems, or better join characteristics, surrogate keys are the safer default. Microsoft: Star schema guidance, Microsoft: Dimensional modeling overview.
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.
Thanks for replying. That's lots of information to digest. I am listing my thoughts on some of your comments below:
"Won't SKs change every reload?"
They must not.
Jea: what if the data source is not stable? When I reload data from SAP to SQL Server, there is often a small difference with the no. of rows imported. The difference is usually very small, e.g., in my first load, 3 million records imported, but when I run the same query, there are 3.000001 million rows imported. In theory, this should not happen because we know SAP doesn't allow users to delete or add journal entries once the fiscal month is closed, but in reality, this is what happened. As a result, all the surrogate key are changed.Hi @Jeanxyz,
These get into some pretty advanced concepts with data warehousing and medallion architecure.
Databricks has a wolderful free eBook that explains these concepts in depth: Big Book of Data Warehousing and BI | Databricks
To directly respond to your comments:
1: the purpose of the SK is to uniquly identify records - it may not be the only columns to consider wwhen doing full analytics. The way I would model that scenario is with some mapping tables. I would build a table that can map the SK from the employee dimension to the salary or payroll demension, and then I would be able to get all salary entries that are for the single employee. I would still avoid using a composite key as that leads to complicated joins downstream, and if you plan on reporting on the data, Power BI does not support composite key relationships.
2: this can be enviornment dependent, but best practises are not to overwrite SKs entirely. The goal is to be able to peform incremental updates down stream, instead of full table reloads every time.
The way I handle this is with more mapping tables - I build a map that will map the business key to the surrogate key, and store that map. It is updated every time the table loads, and assigns the same SK to the record when the record is reloaded. If there is a record where the SK doesn't exist, then a new SK is generated, and the mapping table is updated.
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.
Hi @tayloramy , thanks for the explanation. However, I have been trying my best, but still don't see why it is necessary to create a surrogate key in a dimension table using SCD type 2. The truth is I have worked as a Power BI reporter and SQL Server admin for four years, having created dozen of reports whithout using any surrogate key.
I'm going to give you one example as below:
Name | total annual compensation | effective from | effective to |
Alice | 50000 | 2024-01-01 | 2024-12-31 |
Alice | 60000 | 2025-01-01 | 2025-12-31 |
Name |
Alice |
John |
As you can see, I didn't create any surrogate key here and I don't think surrogate key can solve the problem here. The only benefit I can think of creating Surrogate key is that it helps with query planning in Power BI, but even that is suspicious since I have been able to create large data model without using surrogate keys.
Name | Date Key | Effective annual compensation | Effective daily compensation |
Alice | 2024-01-01 | 50000 | 137 |
Alice | 2024-01-02 | 50000 | 137 |
Hi @Jeanxyz,
As I mentioned these get into some pretty advanced data engineering concepts. If you're able to work without them and and you're happy, then continue being happy.
Here's a different example for where I'm using SKs:
I am building a dim_user table for my org, and I'm collecting various things: a user's Azure UUID, UPN, and legacy powershell PUID, and name.
One of the main use cases for this table is to use with Fabric audit data from the REST APIs, which identify users by UPN or PUID, not UUID.
To add a layer of complexity, PUID is not exposed to microsoft customers anymore.
My process is to first extract users from Azure using the Graph API, which gives me a list of UUIDs and UPNs and names.
I am then combing through the audit data to find records where PUID is populated as well as UPN so I can link in PUIDs - this results in not a complete list of PUIDs, but a workable list for active users in Fabric.
In my org, UPNs can change, which creates a challenge as the fabric audit data uses the UPN as the key to identify who caused the action.
I also don't want to need to fully reload my fact tables that depend on this dim_user table every night.
To achieve this, when I build my dim_user table, I am greating SKs on the fly, and saving them to a SK mapping table that maps Azure UUIDs to the generated SK.
Every night when I rebuild my dim_user table, using that maping table I am assigning users to the same SK every run so as to keep my fact table valid. This ensures that if a UPN changes, that user's record gets updated with the new UPN, but it maps to the same SK as their old UPN, and the audit data is still pointing to the correct user.
I am then able to incrementally update my fact tables without a worry that a changed UPN will make my audit data link to invalid users.
I hope this helps,
If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution
User | Count |
---|---|
4 | |
2 | |
1 | |
1 | |
1 |
User | Count |
---|---|
12 | |
10 | |
4 | |
3 | |
2 |