Don't miss your chance to take the Fabric Data Engineer (DP-600) exam for FREE! Find out how by attending the DP-600 session on April 23rd (pacific time), live or on-demand.
Learn moreNext up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now
Hi!
I am loading huge dataset and trying to optimize in-memory size of it.
Total row count: 310 mln rows, sorted by gtin_prod_key and pbi_site_key, see details below.
I have noticed that my heaviest columns are twice heavier when I load full data set with fact columns compared with when I exclude fact columns, see the screenshots.
All other variables: data itself, cardinality, total number of rows, encoding are the same except I add Select Column statement to exclude columns in one of the queries.
Moreover, I though encoding was the most important in column size, but I tried both hash and value, and the size does not seem to change.
So I have 2 questions:
1. How columns size can be affected by other columns given all other variables are the same?
2. Why my time_period_end_date columns takes so much with such low cardinality even if I convert it to int?
3. Why encoding does not significantly affect column sizes?
With fact columns with value encoding
With fact columns with hash encoding
Without fact columns with value encoding
Without fact columns with has encoding
Sample data
Solved! Go to Solution.
Hi @slava_aptown ,
To put it simply, when I refer to using a surrogate DateID, I mean assigning a small numeric key to each unique date rather than storing the full date value in your large fact table. In other words, you number your distinct dates from 1 to N and keep this list in a separate Date table. The fact table then only stores the compact DateID.
For example
| DateID | Actual Date Value |
|---|---|
| 1 | 45292 |
| 2 | 45293 |
| 3 | 45294 |
VertiPaq is able to compress small, densely packed numbers much more effectively than a large integer range like 45,000–46,000. With a dataset of 300 million rows, this can significantly impact memory usage.
You’ll still need to join to the Date table to access actual dates or calendar details, so the functionality remains the same it’s simply a more efficient use of storage.
Thanks for your response.
Hi @slava_aptown ,
If you haven’t had a chance to review my response yet, please do so and let me know if you need any additional details.
Thank you.
Hi @slava_aptown ,
Everything should be clear now. If you need any additional information, please let me know.
Thank you.
Hi @slava_aptown ,
Thank you for reaching out to the Microsoft Fabric Community. When you load only the key columns, VertiPaq compresses them efficiently. Adding fact columns causes the engine to create new internal segments, which reduces compression and means the key columns use more memory, even if their values stay the same.
To optimize size, you can
1. Sort the data so that repeating values are grouped together.
2. Reduce or round numeric fact columns since they affect compression the most.
3. Use a surrogate DateID instead of raw date integers.
4. Remove columns not needed for reporting.
These are common ways to manage model size for large fact tables.
Regards,
Yugandhar.
Thanks for explanation.
What do you mean by "Use a surrogate DateID instead of raw date integers."?
How should this surrogate DateID look like? Do you mean just enumeration of all dates I have from 1 to some int?
Hi @slava_aptown ,
To put it simply, when I refer to using a surrogate DateID, I mean assigning a small numeric key to each unique date rather than storing the full date value in your large fact table. In other words, you number your distinct dates from 1 to N and keep this list in a separate Date table. The fact table then only stores the compact DateID.
For example
| DateID | Actual Date Value |
|---|---|
| 1 | 45292 |
| 2 | 45293 |
| 3 | 45294 |
VertiPaq is able to compress small, densely packed numbers much more effectively than a large integer range like 45,000–46,000. With a dataset of 300 million rows, this can significantly impact memory usage.
You’ll still need to join to the Date table to access actual dates or calendar details, so the functionality remains the same it’s simply a more efficient use of storage.
Thanks for your response.
If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.
A new Power BI DataViz World Championship is coming this June! Don't miss out on submitting your entry.
Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.
| User | Count |
|---|---|
| 44 | |
| 43 | |
| 38 | |
| 18 | |
| 16 |
| User | Count |
|---|---|
| 67 | |
| 63 | |
| 30 | |
| 30 | |
| 23 |