Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Next up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now

Reply
slava_aptown
Frequent Visitor

[VertiPaq] Columns take twice more space when loaded with other columns

Hi!

I am loading huge dataset and trying to optimize in-memory size of it.

Total row count: 310 mln rows, sorted by gtin_prod_key and pbi_site_key, see details below.

 

I have noticed that my heaviest columns are twice heavier when I load full data set with fact columns compared with when I exclude fact columns, see the screenshots.

All other variables: data itself, cardinality, total number of rows, encoding are the same except I add Select Column statement to exclude columns in one of the queries.

 

Moreover, I though encoding was the most important in column size, but I tried both hash and value, and the size does not seem to change.

 

So I have 2 questions:

1. How columns size can be affected by other columns given all other variables are the same?

2. Why my time_period_end_date columns takes so much with such low cardinality even if I convert it to int?

3. Why encoding does not significantly affect column sizes?

 

With fact columns with value encoding

slava_aptown_3-1766056698215.png

With fact columns with hash encoding

slava_aptown_4-1766056755398.png

 

Without fact columns with value encoding

slava_aptown_0-1766053880691.png

Without fact columns with has encoding

slava_aptown_2-1766055320783.png

 

 Sample data

  • time_period_end_date: int with min 45292, max 46006
  • gtin_prod_key: 8-digit int with negative and positive values in a wide range (hash)
  • pbi_site_key: int with min 376922, max 1085673
  • distribution_id: int with min 1, max 8702225

slava_aptown_1-1766054284125.png

 

1 ACCEPTED SOLUTION

Hi @slava_aptown ,

To put it simply, when I refer to using a surrogate DateID, I mean assigning a small numeric key to each unique date rather than storing the full date value in your large fact table. In other words, you number your distinct dates from 1 to N and keep this list in a separate Date table. The fact table then only stores the compact DateID.

 

For example

DateID Actual Date Value
1 45292
2 45293
3 45294

 

VertiPaq is able to compress small, densely packed numbers much more effectively than a large integer range like 45,000–46,000. With a dataset of 300 million rows, this can significantly impact memory usage.

You’ll still need to join to the Date table to access actual dates or calendar details, so the functionality remains the same it’s simply a more efficient use of storage.

 

Thanks for your response.

View solution in original post

5 REPLIES 5
V-yubandi-msft
Community Support
Community Support

Hi @slava_aptown ,

If you haven’t had a chance to review my response yet, please do so and let me know if you need any additional details.

 

Thank you.

V-yubandi-msft
Community Support
Community Support

Hi @slava_aptown ,

Everything should be clear now. If you need any additional information, please let me know.


Thank you.

V-yubandi-msft
Community Support
Community Support

Hi @slava_aptown ,

Thank you for reaching out to the Microsoft Fabric Community. When you load only the key columns, VertiPaq compresses them efficiently. Adding fact columns causes the engine to create new internal segments, which reduces compression and means the key columns use more memory, even if their values stay the same.

To optimize size, you can

1. Sort the data so that repeating values are grouped together.

2. Reduce or round numeric fact columns since they affect compression the most.

3. Use a surrogate DateID instead of raw date integers.

4. Remove columns not needed for reporting.

 

These are common ways to manage model size for large fact tables. 

 

Regards,

Yugandhar.

 

 

 

 

Thanks for explanation.

What do you mean by "Use a surrogate DateID instead of raw date integers."?

How should this surrogate DateID look like? Do you mean just enumeration of all dates I have from 1 to some int?

Hi @slava_aptown ,

To put it simply, when I refer to using a surrogate DateID, I mean assigning a small numeric key to each unique date rather than storing the full date value in your large fact table. In other words, you number your distinct dates from 1 to N and keep this list in a separate Date table. The fact table then only stores the compact DateID.

 

For example

DateID Actual Date Value
1 45292
2 45293
3 45294

 

VertiPaq is able to compress small, densely packed numbers much more effectively than a large integer range like 45,000–46,000. With a dataset of 300 million rows, this can significantly impact memory usage.

You’ll still need to join to the Date table to access actual dates or calendar details, so the functionality remains the same it’s simply a more efficient use of storage.

 

Thanks for your response.

Helpful resources

Announcements
New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.

Power BI DataViz World Championships carousel

Power BI DataViz World Championships - June 2026

A new Power BI DataViz World Championship is coming this June! Don't miss out on submitting your entry.

Join our Fabric User Panel

Join our Fabric User Panel

Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.

March Power BI Update Carousel

Power BI Community Update - March 2026

Check out the March 2026 Power BI update to learn about new features.