Solved: why use surrogate key

Jeanxyz · ‎09-16-2025

I have learned how to create a surrogate key when importing data into warehouse using SQL query, but coming from a Power BI background, I don't see the benefits of creating surrogate key since one can not use it in search or relationship building.

I have read a few articles about surrogate key, but none of them persuade me into believeing surrogate key is critical. It seems to me the surrogate key is only meaningful when it comes timestamp information.Won't it cause problem that the surrogate key is changed each time the data gets reloaded to the warehouse. Can someone gives me a few example when surrogate key is necessary?

tayloramy · ‎09-19-2025

Hi @Jeanxyz,

As I mentioned these get into some pretty advanced data engineering concepts. If you're able to work without them and and you're happy, then continue being happy.

Here's a different example for where I'm using SKs:

I am building a dim_user table for my org, and I'm collecting various things: a user's Azure UUID, UPN, and legacy powershell PUID, and name.

One of the main use cases for this table is to use with Fabric audit data from the REST APIs, which identify users by UPN or PUID, not UUID.

To add a layer of complexity, PUID is not exposed to microsoft customers anymore.

My process is to first extract users from Azure using the Graph API, which gives me a list of UUIDs and UPNs and names.

I am then combing through the audit data to find records where PUID is populated as well as UPN so I can link in PUIDs - this results in not a complete list of PUIDs, but a workable list for active users in Fabric.

In my org, UPNs can change, which creates a challenge as the fabric audit data uses the UPN as the key to identify who caused the action.

I also don't want to need to fully reload my fact tables that depend on this dim_user table every night.

To achieve this, when I build my dim_user table, I am greating SKs on the fly, and saving them to a SK mapping table that maps Azure UUIDs to the generated SK.

Every night when I rebuild my dim_user table, using that maping table I am assigning users to the same SK every run so as to keep my fact table valid. This ensures that if a UPN changes, that user's record gets updated with the new UPN, but it maps to the same SK as their old UPN, and the audit data is still pointing to the correct user.

I am then able to incrementally update my fact tables without a worry that a changed UPN will make my audit data link to invalid users.

I hope this helps,

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution

View solution in original post

Jeanxyz · ‎09-19-2025

@tayloramy Thanks for sharing the use case. I start to get the gist. So the SK here makes sure each user is mapped to one SK. That means oen can use SK to create a clean logic when properties like UPN, PUID can change over time.

However, I'm still a bit confused, according to MS documentation, each dimension table must have a unique identifier column, but the SK in dim_user is not a unique identifier as two rows can be mapped to the same SK due to UPN change. The real dimension table is the UUID-SK mapping, but since UUID is unique for each user, why not simply use UUID as the primary key?

One last question, do you use SK in joining tables or building relationships? It seems to me each SK represents one unique user, hence it can be used in building relationships.

tayloramy · ‎09-19-2025

Hi @Jeanxyz,

THe SKs are unique, when a UPN changes, the new UPN is mapped to the SK of the old UPN using the user's Azure UUID and the mapping table.
So the SK represents the account as a whole. 1 SK per account, an account will never have 2 SKs regardless of the change of their UPN.

Because the SKs are primary keys on the tables, I do use them to join the tables - that is their intended purpose. I then usually hide the SKs from users in PowrBI reports as they ahve no need to see this system generated number.

Depending on the dataset, I sometimes mnake a 64 bit hash of the SK to use as a join key for performance reasons, but it's the same concept.

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

v-lgarikapat · ‎09-18-2025

Hi @Jeanxyz ,

Thanks for reaching out to the Microsoft fabric community forum.

@tayloramy

Thanks for your prompt response

@Jeanxyz

I wanted to follow up and confirm whether you’ve had the opportunity to review the information provided by @tayloramy . If you have any questions or need further clarification, please don’t hesitate to reach out.

We appreciate your engagement and thank you for being an active part of the community.

Best regards,
Lakshmi.

tayloramy · ‎09-16-2025

Hi @Jeanxyz,

Why surrogate keys matter (even if you build in Power BI)

Surrogate keys (SKs) are warehouse-generated IDs that identify each row version of a dimension. They're not for end-user search. They exist to make your star schema stable, performant, and historically correct. Microsoft's guidance and Kimball's dimensional modeling both recommend them for analytic models. Microsoft: Star schema & surrogate keys, Kimball Group: Surrogate keys.

What surrogate keys give you

History (SCD Type 2): When a customer's attributes change (segment, region, name), you insert a new dimension row with a new SK and date ranges, while the old row is retained. Facts for old dates point to the old SK; new facts point to the new SK. This is the canonical pattern and requires SKs. Kimball: SCD Type 2 requires a new surrogate key per change, Microsoft Fabric: SCD Type 2.
Stable joins from facts: Fact tables store foreign keys to dimension SKs, not business keys, so queries always resolve to the correct historical version. Microsoft's warehouse loading guidance explicitly shows looking up the dimension SK when processing facts. Microsoft: Process fact tables with SK lookups.
Insulation from business key changes and multi-source integration: Business keys (emails, product codes, GUIDs) can change, collide across systems, or be composite. SKs decouple the warehouse from those issues and let you conform dimensions across sources. Microsoft: Dimension tables & surrogate keys.
Performance & storage: Narrow integer keys reduce index size and can speed joins compared to long or composite natural keys. Kimball Group: performance & storage rationale.

"But I can't use SKs for search or relationships in Power BI."

End users shouldn't search on SKs; you hide them. But your relationships should still use SKs when you have SCD Type 2 or conformed dimensions. That ensures each fact connects to the correct historical row. Microsoft's star-schema guidance assumes SK-based dimensional modeling even if the report only shows business attributes. Microsoft: Star schema guidance.

"Won't SKs change every reload?"

They must not. If SKs are changing on reload, that's a loading anti-pattern (e.g., truncating and re-inserting the dimension). The fix is to upsert with a business key (BK) lookup so existing members keep their SK, and only SCD-relevant changes create a new SK (and close the old one's date range). Microsoft: Fact processing with SK lookups, Kimball: SCD2 mechanics.

Minimal example

-- Dimension (note the surrogate key and the business key)
CREATE TABLE dbo.DimCustomer (
  CustomerSK     INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  CustomerBK     NVARCHAR(50) NOT NULL,     -- natural/business key from source
  Name           NVARCHAR(200) NOT NULL,
  Segment        NVARCHAR(50)  NOT NULL,
  EffectiveStart DATETIME2(3)  NOT NULL,
  EffectiveEnd   DATETIME2(3)  NOT NULL,
  IsCurrent      BIT           NOT NULL
);
CREATE UNIQUE INDEX UX_DimCustomer_Current
  ON dbo.DimCustomer(CustomerBK, IsCurrent)
  WHERE IsCurrent = 1;

-- Stage holds latest snapshot from source system
-- 1) Close out rows that changed (SCD2 detect)
UPDATE d
SET EffectiveEnd = SYSUTCDATETIME(), IsCurrent = 0
FROM dbo.DimCustomer d
JOIN dbo.StgCustomer s
  ON s.CustomerBK = d.CustomerBK
WHERE d.IsCurrent = 1
  AND (d.Name    <> s.Name OR d.Segment <> s.Segment);

-- 2) Insert new current versions (new SKs) for new or changed BKs
INSERT dbo.DimCustomer (CustomerBK, Name, Segment, EffectiveStart, EffectiveEnd, IsCurrent)
SELECT s.CustomerBK, s.Name, s.Segment, SYSUTCDATETIME(), '9999-12-31', 1
FROM dbo.StgCustomer s
LEFT JOIN dbo.DimCustomer d
  ON d.CustomerBK = s.CustomerBK AND d.IsCurrent = 1
WHERE d.CustomerBK IS NULL
   OR d.Name    <> s.Name
   OR d.Segment <> s.Segment;

-- Fact load: look up the SK for the BK that was current on the fact date
INSERT dbo.FactSales (CustomerSK, SaleDate, Amount)
SELECT d.CustomerSK, f.SaleDate, f.Amount
FROM dbo.StgSales f
JOIN dbo.DimCustomer d
  ON d.CustomerBK = f.CustomerBK
 AND d.IsCurrent = 1
 AND f.SaleDate >= d.EffectiveStart
 AND f.SaleDate <  d.EffectiveEnd;

When you might skip SKs

If your dimension is tiny, static, single-sourced, and you don't need history (no SCD), you can get by with a durable natural key. But once you need history, conformance across systems, or better join characteristics, surrogate keys are the safer default. Microsoft: Star schema guidance, Microsoft: Dimensional modeling overview.

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

Jeanxyz · ‎09-16-2025

Thanks for replying. That's lots of information to digest. I am listing my thoughts on some of your comments below:

History (SCD Type 2): When a customer's attributes change (segment, region, name), you insert a new dimension row with a new SK and date ranges, while the old row is retained. Facts for old dates point to the old SK; new facts point to the new SK. This is the canonical pattern and requires SKs.
Jean: from my experience, I will insert start_date, end-date to each record. Creating creating SK or isCurrent field only tells which record is the latest, but that may not be enough. For example, to calculate employee labor cost per day, I need to know the employee's effective salary on different dates (coming from Employee Salary Dimension table). I'm not interested in whcih record is the latest one in this case.
"Won't SKs change every reload?"
They must not.
Jea: what if the data source is not stable? When I reload data from SAP to SQL Server, there is often a small difference with the no. of rows imported. The difference is usually very small, e.g., in my first load, 3 million records imported, but when I run the same query, there are 3.000001 million rows imported. In theory, this should not happen because we know SAP doesn't allow users to delete or add journal entries once the fiscal month is closed, but in reality, this is what happened. As a result, all the surrogate key are changed.

tayloramy · ‎09-16-2025

Hi @Jeanxyz,

These get into some pretty advanced concepts with data warehousing and medallion architecure.

Databricks has a wolderful free eBook that explains these concepts in depth: Big Book of Data Warehousing and BI | Databricks

To directly respond to your comments:

1: the purpose of the SK is to uniquly identify records - it may not be the only columns to consider wwhen doing full analytics. The way I would model that scenario is with some mapping tables. I would build a table that can map the SK from the employee dimension to the salary or payroll demension, and then I would be able to get all salary entries that are for the single employee. I would still avoid using a composite key as that leads to complicated joins downstream, and if you plan on reporting on the data, Power BI does not support composite key relationships.

2: this can be enviornment dependent, but best practises are not to overwrite SKs entirely. The goal is to be able to peform incremental updates down stream, instead of full table reloads every time.
The way I handle this is with more mapping tables - I build a map that will map the business key to the surrogate key, and store that map. It is updated every time the table loads, and assigns the same SK to the record when the record is reloaded. If there is a record where the SK doesn't exist, then a new SK is generated, and the mapping table is updated.

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution.

Jeanxyz · ‎09-19-2025

Hi @tayloramy , thanks for the explanation. However, I have been trying my best, but still don't see why it is necessary to create a surrogate key in a dimension table using SCD type 2. The truth is I have worked as a Power BI reporter and SQL Server admin for four years, having created dozen of reports whithout using any surrogate key.

I'm going to give you one example as below:

Business need: to find out the total compensation cost per employee per day
two tables available:
- employee-annual compensation table using SCD type2
  Name total annual compensation effective from effective to
  Alice 50000 2024-01-01 2024-12-31
  Alice 60000 2025-01-01 2025-12-31
- employee list
  Name
  Alice
  John
solution: I created a employee-Date table in Power BI (see below). I then fill in the 'Effective annual compensation' of the employee on that specific day using a measure in Power BI. The measure basically goes through each row in the solution table, record the employee name and Date Key, it then goes to employee-compensation table, filter the table by employee name =Alice && 'effectve from' >= the Date Key of current row && 'effective to' <= the Date Key of current row. Once I got the 'Effective anual compensation' data, I can easily calculate daily compensation.

As you can see, I didn't create any surrogate key here and I don't think surrogate key can solve the problem here. The only benefit I can think of creating Surrogate key is that it helps with query planning in Power BI, but even that is suspicious since I have been able to create large data model without using surrogate keys.

Name Date Key Effective annual compensation Effective daily compensation
Alice 2024-01-01 50000 137
Alice 2024-01-02 50000 137

tayloramy · ‎09-19-2025