Solved: Star Schema Fundamentals

CenTeam · ‎08-13-2025

Hey All,

(I am new to this forum, so if I break any etiquette rules please let me know and will do better in the future. I have separate questions combined in this post spaced out with asterisks so that the differences are more clear. Any help to any of these questions would be greatly appreciated.)

In my company we do have a "lab space" where we can build our mini data warehouse. This was built before and outside the need for Power BI dashboards. Now that we are building dashboards we can use some already existing tables in the lab space, and can build other tables as needed. However, since this is a shared lab space it won't work to have several small dimension tables bloating the lab space. The dashboard will not be a live connection and will be refreshed either daily or weekly. So here's what I am thinking, and let me know what you think:

PART I - I can create/store FACT tables in the lab space but for Dimension (DIM) tables, when choosing the Source in Power BI I can generate the DIM table through the query, so I would store the code for the Query as reference/posterity instead of bloating the lab space with smaller tables. Does this make sense as these tables are small and/or non-complex to generate?

PART II - Piggy-backing on the above, could I also do similarly with the FACT table by storing the query instead of saving it in the lap space? I guess what it boils down to is that I don't know if there is a genuine performance benefit with one or the other (I am not choosing Direct Query). My FACT table is 2+ million rows and won't grow substantially more over time. The potential elephant in the room here is having too many tables in the shared lab space, so I want to balance out when it is beneficial and when I can get away with storing the query instead for the specific PBI dashboard.

************

This dashboard will have separate pages for Medicaid and Medicare since the former has much more detailed fields than the latter, and as such I have separate FACT tables for each. However, there are some lab space tables where Caid/Care are combined since they share the same fields equally (e.g. Latest Address). So in the case of Addresses I have this one table for two FACT tables. What's the appropriate way to handle this:

I am thinking I create 2 separate tables in PBI where in the underlying connecting queries I would filter for Caid/Care as applicable from that 1 Address lab space table. Does that make sense or should I go about it a different way?

************

About Data Modeling in general, not all of the Foreign Keys in my FACT table are numeric, and I understand the principled storage/performance benefit of having numeric unique identifiers. So what I am not sure of is the necessity to code in numeric unique identifiers for the sake of this model. I have 2 separate real world examples:

Example 1: Generating the Race column requires code that when evaluated you can derive a person's identified race. Instead of the output being the actual race do I then give it a numeric value that I would then use it for the Foreign Key when joining to the DIM table for Race? Or is the performance it minimal to keep the word and just have a 1 column DIM table for Race that has the name of the Race. I don't mind the extra work but I also don't want to create unnecessary work either.

Example 2: The unique identifier for a person's plan is non-numeric. It doesn't seem intuitive to me to create a replacement numeric unique identifier, but I don't want to assume and so I am asking here for those with more experience in these circumstances.

Again, this is something where I can see a potential bad habit forming and/or I can be overthinking it, and thus could use some guidance on what would be appropriate when.

Apologies for the length, but I thought it would be better to ask all of my current questions in one post than to flood with multiple posts. Thanks in advance!

johnt75 · ‎08-14-2025

You don't need to create either the fact tables or dimension tables in the lab space, you can just run the queries in Power Query as part of the import process. Depending on what your lab space is, you might consider creating views of the queries, perhaps in a separate schema if you're trying to avoid clutter. Views don't take up any additional space unlike creating new tables.

The one thing I would say is to do as much as you can in the lab space rather than doing it in Power Query. For example, don't use Power Query to pull out unique values from the fact table to create a dimension table, as that would put unnecessary load on the import process and you would be wasting compute resource within the Power BI Service. Instead write a query to get the dimension direct from the lab space.

On your foreign keys question it depends on the length of the keys. For the race example I would say that it is fine to use the text as the field for relationship between fact and dimension. From a storage point of view, all the unique values of the text field are only stored once in the dictionary along with an integer identifier ( generated internally by Power BI ) and it is this integer that is stored for every row of the fact table. So the difference between using text or using a code that you generate will be minimal.

For the plan's unique identifier, it depends on a couple of things. How long is the identifier, and what do you need it for? If you only want to be able to mark each row in the fact table with a unique key then I would generate something within the query, perhaps using ROW_NUMBER in SQL. If you actually need to display the unique plan identifier so that users can look it up in another report or system then you have no choice but to include it.

What you could do is to import the data with the existing unique identifier and then use DAX Studio to view the metrics and see how large the column for the unique ID is, and how much of the whole model size that represents. If you consider that its too big then replace it with an integer identifer and see how much that takes up.

View solution in original post

CenTeam · ‎08-29-2025

Genuine apologies for the late reply! I got a notification that this was marked as "spam" and so I thought it was permanently removed. Ironically, I thought "Fabric Community Administrator" in my work inbox was spam until this morning!

Anyway, thank you for the in-depth reply. For the most part I have already implemented what you were getting at that most of my Power BI Queries are based on a query via import process. Additionally, that virtually all of the queries have the Source step, except where I have to adjust the formatting of some columns due to how Power Query has interpreted them. I only have 1 new Lab Space table and that's really for the table where I am making numerical unique identifiers as per what I said.

On your foreign keys question it depends on the length of the keys. For the race example I would say that it is fine to use the text as the field for relationship between fact and dimension. From a storage point of view, all the unique values of the text field are only stored once in the dictionary along with an integer identifier ( generated internally by Power BI ) and it is this integer that is stored for every row of the fact table. So the difference between using text or using a code that you generate will be minimal.

Do you mind elaborating on the second half of this? The length of these keys are usually within 20 characters. So at what point is it worth taking the time to create numerical unique identifiers where text unique_identifers exist (given that the list of unique identifiers are relatively small)?

For the plan's unique identifier, it depends on a couple of things. How long is the identifier, and what do you need it for? If you only want to be able to mark each row in the fact table with a unique key then I would generate something within the query, perhaps using ROW_NUMBER in SQL. If you actually need to display the unique plan identifier so that users can look it up in another report or system then you have no choice but to include it.

The length is up to 30 characters, though most are within 10 and the list comprises 10 - 20, depending on whether the plan type. For Medicaid I do need to display it as these "codes" are fairly visible and understood to front end users. For Medicare not so much, but that could change. So this ties back to the prior question of the true value-add (though I don't mind doing it if it is worth it).

What you could do is to import the data with the existing unique identifier and then use DAX Studio to view the metrics and see how large the column for the unique ID is, and how much of the whole model size that represents. If you consider that its too big then replace it with an integer identifer and see how much that takes up.

I will try this, but I haven't used it before. When you discuss model size representation, how would I interpret what would be considered an "acceptable" performance/outcome vs. one that is not?

johnt75 · ‎09-01-2025

To clarify about the data dictionary, when you have a column of text values Power BI automatically creates a table containing the unique values from the text column along with an integer index which identifies that particular value. This is exactly what you would have to do in your database if you wanted to use numeric keys instead of text strings, but Power BI does it automatically for you. The only difference is that if you did the translation manually then the text strings would not take up space in your Power BI model, they would only exist in the database.

For example, if you had 1 million rows and each row had a unique identifier of length 30 characters then if you loaded that into Power BI it would need approx 30Mb just to store the text in the data dictionary, plus however much capacity it needed for the integer indexes. By doing the translation yourself in the database you would not need to store that 30Mb in your model, which would save a little space and a little time during data refresh, but really 30Mb is not that big and is not worth worrying about by itself.

From what you've said, although the identifiers are relatively long there are not too many unique values and so the space taken up will be pretty insignificant, not worth your time to translate them manually into integer representations when Power BI will do that for you automatically. Also, as it could be beneficial to users to display the IDs, at least for Medicaid, because the users can derive information from them then I would think that it makes sense to include them in the model, given that the extra storage required is going to be fairly minimal.

As for what would be considered "acceptable", there are no hard and fast rules unfortunately, its more a general feel of what makes sense. If a large proportion of the space taken up by the model is used by text columns which are only used for establishing relationships, and never displayed to the user, then I would try to replace those with integer keys in order to reduce the size of the model. But I would also need to balance that with the complexity of maintaining the mapping between text key and integer index, and ask questions like can I maintain this myself or would I need to involve another team to do it? How much maintenance would this require - once the process is in place will it maintain itself or will I need to perform manual work every month ? At the end of the day it is essentially a cost / benefit calculation. If the model performs adequately in both data refresh and query execution time, then its probably not worth the time to do the optimisations.

CenTeam · ‎09-03-2025

Great feedback and thank you as I now know what more is going on under the hood to help make future decisions. I have a final question (for now) that is slightly different but still within the umbrella of modeling. I'll ask it here though if I should create a new topic for this I will do so.

As per a typical model, my Fact table is on the "Many" side and my DIM tables is on the "One" side. Now I have a Conditions table, where my unique members in my Fact is now on the 1-side and the Conditions table is on the Many side (as a member can have more than one condition). In order for my Conditions selection to also filter the visualizations for the data in the DIM tables I need to change the Cross-Filter Direction to "Bi-Directional" which I know is a general performance hit (though my visualizations are still performing fine). Is this the nature of the beast due to how the datasets are related or am I to do something else in order to still get the filtering I want in the visualizations, while reducing the performance strain?

johnt75 · ‎09-03-2025

I would strongly recommend against using bi-directional filtering. Not only can it degrade performance but it can also lead to unintended and unpredictable results.

I would recommend that you create a Conditions dimension table, containing all the unique conditions you need to handle.

There are several ways of linking that to the model. One possibility would be to denormalise your existing fact and condition tables into 1, so that each member would be replicated for each condition they have. This would obviously increase the amount of data stored, but if the dataset is not too big that might not be an issue. One thing to bear in mind though would be that you could no longer use COUNTROWS to get the number of members, you would have to use a DISTINCTCOUNT on the member ID. That would be much slower.

Another possibility would be to link the Condition dimension to the Condition fact table, and use columns from the Conditions dimension in all your filters and visuals. However for this approach to work, either the measures you use must depend on values in the condition fact table, or you could create a calculation group and apply it to pages or visuals as required, something like

Condition modifier =
IF (
    ISFILTERED ( 'Condition Dimension' ),
    CALCULATE (
        SELECTEDMEASURE (),
        KEEPFILTERS (
            TREATAS ( VALUES ( 'Condition Fact'[Member ID] ), 'Member Fact'[Member ID] )
        )
    ),
    SELECTEDMEASURE ()
)

CenTeam · ‎09-03-2025

I think the calculation group option makes the most sense with the data model I have. I will explore that, thanks.

v-dineshya · ‎08-14-2025

Hi @CenTeam ,

Thank you for reaching out to the Microsoft Community Forum.

Hi @johnt75 , Thank you for your prompt response.

Hi @CenTeam , Could you please try the proposed solution shared by @johnt75 . I hope this information helps. Please do let us know if you have any further queries.

Regards,

Dinesh

v-dineshya · ‎08-17-2025

Hi @CenTeam ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. And, if you have any further query do let us know.

Regards,

Dinesh

v-dineshya · ‎08-22-2025

Hi @CenTeam ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. And, if you have any further query do let us know.

Regards,

Dinesh

v-dineshya · ‎08-25-2025

Hi @CenTeam ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. And, if you have any further query do let us know.

Regards,

Dinesh

johnt75 · ‎08-14-2025

You don't need to create either the fact tables or dimension tables in the lab space, you can just run the queries in Power Query as part of the import process. Depending on what your lab space is, you might consider creating views of the queries, perhaps in a separate schema if you're trying to avoid clutter. Views don't take up any additional space unlike creating new tables.

The one thing I would say is to do as much as you can in the lab space rather than doing it in Power Query. For example, don't use Power Query to pull out unique values from the fact table to create a dimension table, as that would put unnecessary load on the import process and you would be wasting compute resource within the Power BI Service. Instead write a query to get the dimension direct from the lab space.

On your foreign keys question it depends on the length of the keys. For the race example I would say that it is fine to use the text as the field for relationship between fact and dimension. From a storage point of view, all the unique values of the text field are only stored once in the dictionary along with an integer identifier ( generated internally by Power BI ) and it is this integer that is stored for every row of the fact table. So the difference between using text or using a code that you generate will be minimal.

For the plan's unique identifier, it depends on a couple of things. How long is the identifier, and what do you need it for? If you only want to be able to mark each row in the fact table with a unique key then I would generate something within the query, perhaps using ROW_NUMBER in SQL. If you actually need to display the unique plan identifier so that users can look it up in another report or system then you have no choice but to include it.

What you could do is to import the data with the existing unique identifier and then use DAX Studio to view the metrics and see how large the column for the unique ID is, and how much of the whole model size that represents. If you consider that its too big then replace it with an integer identifer and see how much that takes up.

Star Schema Fundamentals

Helpful resources

Power BI Monthly Update - September 2025

Fabric Community Update - August 2025

How to Get Your Question Answered Quickly

FabCon is coming to Atlanta

Star Schema Fundamentals

Helpful resources

Power BI Monthly Update - September 2025

Fabric Community Update - August 2025

How to Get Your Question Answered Quickly