Solved: Re: Designing a Star Schema Data Warehouse for a S...

os_ca · ‎09-08-2024

I have a question regarding the design of a star schema for a school in MS Fabric. Unlike typical business scenarios such as internet and retail sales, the assessment data requires many measurable columns. I've already created the following dimension tables: DimStudent, DimTime, DimSubject, DimClass, DimStaff. However, I am uncertain about the best approach to building fact tables.

Source data set:

Grade A=5, B=4, C=3, D=2, F=1 for analysis

Approach 1: Single fact table

DateKey, StudentKey, SubjectKey, ClassKey, CalculatedMark, CalculatedGrade, ModeratedMark, ModeratedGrade, Assessment1 to Assessment15, Profile1 to Profile15
This would result in a fact table with 38 or more columns.

Approach 2: Multiple Fact Tables

FactMark: DateKey, StudentKey, SubjectKey, ClassKey, CalculatedMark, CalculatedGrade, ModeratedMark, ModeratedGrade
FactAssessment: DateKey, StudentKey, SubjectKey, ClassKey, Assessment1 to Assessment15
FactProfile: DateKey, StudentKey, SubjectKey, ClassKey, Profile1 to Profile15

Approach 3: Multiple Fact Tables with Associated Dimension Tables

FactMark: DateKey, StudentKey, SubjectKey, ClassKey, CalculatedMark, CalculatedGrade, ModeratedMark, ModeratedGrade
FactAssessment: DateKey, StudentKey, SubjectKey, ClassKey, AssessmentTypeKey, AssessmentValue
FactProfile: DateKey, StudentKey, SubjectKey, ClassKey, ProfileTypeKey, ProfileValue
DimAssessmentType: TypeKey, TypeDescription
DimProfileType: TypeKey, ProfileDescription

Approach 4: Single Fact Table with Associated Dimension Table

FactSchoolAcademicResult: DateKey, StudentKey, SubjectKey, ClassKey, ResultType, ResultValue
DimAcademicResultType: TypeKey, TypeDescription (This includes all result types such as marks, assessments, and profiles.)

Which approach would be the best among these four, or do you have a better suggestion for my design?

FabianSchut · ‎09-08-2024

Hi @os_ca,

Looking at approach 1 and 2, it seems to me that the number of columns for assessment and profile could change in the future. When an assessment or profile is added or removed, you need to redesign your fact table. This is not a good design in my opinion. Furthermore, what if you want to calculate an average for profile or assessment in, for example, Power BI, you should select all columns individually in your measure.

Approach 3 and 4 mitigate these problems well, since a new row could be added for the new profiles and assessments. When choosing between approach 3 and 4 we can look at the granularity level for the rows. The granularity level for a mark, assessment and profile remains the same: it is a academic result per date, per student, per subject, per class. Since the granularity is the same, I would choose for a single fact table that you have set up in apporach 4.

Kind regards,

Fabian

View solution in original post

AndyDDC · ‎09-08-2024

I'd be a supporter of the 4th option. You keep it simple with a single Fact table, but keep it flexible by not having a bunch of columns with numbers that could change (as @FabianSchut @stated).

If you then need to pivot the fact rows to columns for use in power bi, then that would be very simple to do. Eg a view that loads into power bi pivots each students row data into column data.

The caveat here is that this fact table design may affect the use of direct lake connection if you have to do any form of row/column pivoting to shape the fact data needed in a semantic model. Eg the example above where I stated using a view to pivot would then break direct lake

View solution in original post

frithjof_v · ‎09-08-2024

I agree with the previous speakers, option 4 seems to be the most scalable option and I think it will be good for analysis in Power BI also.

You can probably use matrix visual with TypeDescription in the columns field, or use TypeDescription in the legend field in some other visual types (e.g. line charts) in order to mitigate (or potentially avoid) the need for pivoting. Alternatively you could also consider to create separate DAX measures for specific types, using a Calculate condition in each measure to select the specific type. This should also work with Direct Lake if you need. However, Import Mode could prove to be a better option. With this kind of data, you can probably tolerate some latency. Ultimately, it depends on your specific needs. https://www.sqlbi.com/blog/marco/2024/04/06/direct-lake-vs-import-mode-in-power-bi/

Would you need a TypeCategory column in your DimAcademicResultType table? E.g. the following columns: ResultTypeKey, ResultTypeDescription, ResultTypeCategory (where the ResultTypeCategory would contain values like 'Mark', 'Assessment', 'Profile'). This could be useful for filtering or grouping the result types in a hierarchical manner.

However (potential reasons to choose Option 3):

I'm not familiar with the education system in your country.

You could argue for creating three Fact tables: Fact_Marks, Fact_Assessments and Fact_Profiles.

Facts are events.

If these are the three "main event types" which you want to analyze, and especially if these three "main event types" have quite different natural characteristics, i.e. there could be different dimensions which would be relevant for each of these categories, or you would analyze each of these event types in it's own unique way, I think you could consider having three Fact tables. This would make it easier to logically separate these fact types. It could possibly also increase performance in some use cases.

I think a key aspect is to consider how different these fact types are from each other (different fact types -> consider separate tables) or how related or similar they are (related or similar facts -> consider having them in the same fact table).

I don't know how related or similar a "mark event" and an "assessment event" and a "profile event" is in your school system. If they are separate concepts, and especially if they would have some unique dimensions which are not relevant for the other two event types, I would consider having separate fact tables and one or more "tailor-made" associated dimension tables for each fact table, i.e. option 3. But if you're going to only have a key column and a description column in your associated dimension table, it means the associated dimension tables for each fact type would have identical schemas, then I think option 4 makes most sense.

Will all the results be integers? Or would some results be text? If the data type of the 'marks' result is an integer, while the 'profile' or 'assessment' results would be string, that would also be a reason to have them in separate tables.

View solution in original post

frithjof_v · ‎09-08-2024

I agree with the previous speakers, option 4 seems to be the most scalable option and I think it will be good for analysis in Power BI also.

You can probably use matrix visual with TypeDescription in the columns field, or use TypeDescription in the legend field in some other visual types (e.g. line charts) in order to mitigate (or potentially avoid) the need for pivoting. Alternatively you could also consider to create separate DAX measures for specific types, using a Calculate condition in each measure to select the specific type. This should also work with Direct Lake if you need. However, Import Mode could prove to be a better option. With this kind of data, you can probably tolerate some latency. Ultimately, it depends on your specific needs. https://www.sqlbi.com/blog/marco/2024/04/06/direct-lake-vs-import-mode-in-power-bi/

Would you need a TypeCategory column in your DimAcademicResultType table? E.g. the following columns: ResultTypeKey, ResultTypeDescription, ResultTypeCategory (where the ResultTypeCategory would contain values like 'Mark', 'Assessment', 'Profile'). This could be useful for filtering or grouping the result types in a hierarchical manner.

However (potential reasons to choose Option 3):

I'm not familiar with the education system in your country.

You could argue for creating three Fact tables: Fact_Marks, Fact_Assessments and Fact_Profiles.

Facts are events.

If these are the three "main event types" which you want to analyze, and especially if these three "main event types" have quite different natural characteristics, i.e. there could be different dimensions which would be relevant for each of these categories, or you would analyze each of these event types in it's own unique way, I think you could consider having three Fact tables. This would make it easier to logically separate these fact types. It could possibly also increase performance in some use cases.

I think a key aspect is to consider how different these fact types are from each other (different fact types -> consider separate tables) or how related or similar they are (related or similar facts -> consider having them in the same fact table).

I don't know how related or similar a "mark event" and an "assessment event" and a "profile event" is in your school system. If they are separate concepts, and especially if they would have some unique dimensions which are not relevant for the other two event types, I would consider having separate fact tables and one or more "tailor-made" associated dimension tables for each fact table, i.e. option 3. But if you're going to only have a key column and a description column in your associated dimension table, it means the associated dimension tables for each fact type would have identical schemas, then I think option 4 makes most sense.

Will all the results be integers? Or would some results be text? If the data type of the 'marks' result is an integer, while the 'profile' or 'assessment' results would be string, that would also be a reason to have them in separate tables.

Oliver2024 · ‎09-09-2024

All results are stored as strings. But some should only contain numbers.

eg. Calculated Mark might be "79". But, later it is converted to the number 79.

frithjof_v · ‎09-09-2024

Will the result value column in the star schema be integer type (for all result types)?

What I'm getting at is that a common table works fine, but I think it only works fine if the schema for all the facts in the common table is the same, and also the schema of the associated dimension tables are the same.

Because marks are integers, the result column which stores marks in the star schema should be of integer type. Power BI should not have to do that type conversion at read time, that would slow down Power BI.

So the assessment result values and profile results values also need to be stored as integers in the star schema, if they (marks, assessments, profiles) are all in the same fact table.

AndyDDC · ‎09-08-2024

I'd be a supporter of the 4th option. You keep it simple with a single Fact table, but keep it flexible by not having a bunch of columns with numbers that could change (as @FabianSchut @stated).

If you then need to pivot the fact rows to columns for use in power bi, then that would be very simple to do. Eg a view that loads into power bi pivots each students row data into column data.

The caveat here is that this fact table design may affect the use of direct lake connection if you have to do any form of row/column pivoting to shape the fact data needed in a semantic model. Eg the example above where I stated using a view to pivot would then break direct lake

FabianSchut · ‎09-08-2024

Hi @os_ca,

Looking at approach 1 and 2, it seems to me that the number of columns for assessment and profile could change in the future. When an assessment or profile is added or removed, you need to redesign your fact table. This is not a good design in my opinion. Furthermore, what if you want to calculate an average for profile or assessment in, for example, Power BI, you should select all columns individually in your measure.

Approach 3 and 4 mitigate these problems well, since a new row could be added for the new profiles and assessments. When choosing between approach 3 and 4 we can look at the granularity level for the rows. The granularity level for a mark, assessment and profile remains the same: it is a academic result per date, per student, per subject, per class. Since the granularity is the same, I would choose for a single fact table that you have set up in apporach 4.

Kind regards,

Fabian

Oliver2024 · ‎09-08-2024

Thanks for all your recommendations. The source data is already in a pivot format, and the number of columns (around 38) will remain the same. The issue is that some columns are used or some not used depending on the year group, term and subject. Also, profile and assessment columns have their own labels. For example, Profile 1 could be 'Complete homework on time' for Year 7 English, but 'Team work' for year 12 English. Given this, I have to choose Approach 3 and 4. I tried to minimize additional work, such as creating a pivot view for reporting. All three academic categories or types seem to be related each other. For example, if a student finishes home work on time most of the time, they could have higher scores. This means Approach 4 sounds best to me. I appreciate your help and time.

frithjof_v · ‎09-08-2024

Tbh when I was thinking about "related" facts, I was mainly thinking about whether the marks, assessments and profiles were different "milestones" (measures) of the same event (i.e. accumulating snapshot fact table https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimen...). Then I would keep them in the same table (an accumulating snapshot table). However I am guessing that is not the case here. Using an accumulating snapshot fact table would also mean using a "hard coded" table schema which limits the flexibility (evolution) of the types of measured facts. Instead, I am guessing one (or more) transaction fact tables is a good solution in your case. This aligns with option 4 (or 3).

I think in your case I would just consider whether or not these three main types of facts require different dimension columns for analysis and if additional fact columns would be relevant for some of the fact types, and also whether the contents in their result column have similar type of values.

E.g. if the results for marks are integer values but results from assessments and/or profiles would contain qualitative results (text), I would probably separate the fact tables.

Another hypothetical example: if assessments were often group work, you might want to include a column with "number of group participants" or "average grade of group participants". In this case, the schema of the assessment fact would be different than the schema of the mark and profile facts. If so, I would probably go with option 3 as it allows different table schemas for each main fact type (marks, assessments, profiles).

With option 3, you get more flexibility to adjust the schemas of each individual main fact type later on, if needed. With option 4, any schema changes would be applied to the entire common fact table. Which could be inefficient, if the adjustment was really only needed for one of the fact types (e.g. assessments).

However if the measures and related dimensions of all three main fact types fit into a uniform schema, e.g. you just need a single numerical measure on all fact events within all fact types, then option 4 is likely the option which will be easiest to implement and maintain.

Option 4 makes it easier to use a single slicer to hand-pick which result types to include in analysis across all fact types. It also makes it more convenient to compare the scores of different result types in a single visual (by using the result type as categories in the legend or columns field e.g. in matrix visual or line charts). Including a ResultTypeCategory attribute would make it even easier for example to make hierarchical slicer selections, etc.

Perhaps report performance will be faster when using option 3. If you're only concerned about profiles on a report page, then option 3 allows you to query a table which only contains profile facts. Which will presumably be faster than having to query a common fact table that also includes other (in this case irrelevant marks and assessment) fact records. The performance impact might not be significant enough to be a decisive factor, though.

Oliver2024 · ‎09-09-2024

As Fabian mentions, the granularity level for a mark, assessment and profile stays the same.

Fact mark:

student ID, subject, class, type, mark

1, English, Eng01, CalculatedMark, 70

Fact assessment:

student ID, subject, class, type, mark

1, English, Eng01, Final Mark, 75

Fact profile:

student ID, subject, class, type, mark

1, English, Eng01, "complete homework", A (most of the time)

frithjof_v · ‎09-09-2024

I think values like 'A (most of the time)' would need to be converted to an integer.

This is when the common table (option 4) begins to struggle a bit and adaptations or compromises need to be made.

It can be worked around (e.g. by conversion to integer) or the facts can be separated into three main fact tables, for instance.

I think I would have converted it to a number anyway (regardless of whether you use a common fact table or separated fact tables), as numerical values are the best option for statistical and trend analysis.

However if you're just going to display the marks, without doing calculations like averages, they could in theory be strings I guess. But I guess you will be interested in calculating averages and trends and then they should be numerical.

If the purpose of the star schema is to visualize these data in e.g. Power BI, you could start already now to try to build some visualizations and report pages in Power BI in order to verify how the data model works for the end purpose.

If you want, you could even make some dummy data just for testing the table schemas directly in Power BI, before creating the tables in the Warehouse.

Anonymous · ‎09-08-2024

Hi @os_ca ,

Did the above suggestions help with your scenario? if that is the case, you can consider Kudo or Accept the helpful suggestions to help others who faced similar requirements.

If these also don't help, please share more detailed information and description to help us clarify your scenario to test.

How to Get Your Question Answered Quickly

Regards,

Xiaoxin Sheng