Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM. Register now.

Reply
gp10
Advocate III
Advocate III

High Cardinality solution by SQLBI not working

Hi beloved community,

I'm trying to implement @marcorusso 's solution on a column with high cardinality but I'm not getting the expected result.

The solution is described in this great article: https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/

 

Context:

  • I'm connecting to an Amazon Redshift db with the Redshift connector of Power BI.
  • The table I want to query is a huge one with billion of rows.
  • One particular integer column (not a primary key) has around 100M distinct values, like an index, starting from 1.
  • If I create a dataset with that column in its original format, the dataset is getting huge and that column takes around 90% of the total size.
  • The result table is being partitioned for incremental refresh.

So, I have created a view in the db where I have splitted that column with the method described in the article. The result is exactly what it should be and the column is splitted in 2 new columns perfectly. I can now create a dataset without any transformation steps in Power Query.

 

But... when I publish the dataset and refresh it, the dataset size remains pretty much the same.


Does anyone have any idea on possible reasons for this?

Thanks.

1 ACCEPTED SOLUTION
marcorusso
Most Valuable Professional
Most Valuable Professional

Unfortunately, there is no such thing as a free lunch.

What we describe in the article works as long as the data is small and the dictionary size is big.

When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.

Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.

The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.

It seems you are at the sweet point, meaning you should return to the single column.

The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).

View solution in original post

3 REPLIES 3
marcorusso
Most Valuable Professional
Most Valuable Professional

You might try DirectQuery over SQL Server, but it's going to be more expensive.
The DISTINCTCOUNT algorithm of SQL Server scales better on multiple cores, so if you have good hardware, that calculation could be faster. However, everything else could be slower. You might consider a composite model, but we enter into an area where it really depends on requirements and tradeoffs.
SQL Server also enables the APPROXIMATEDISTINCTCOUNT which is way faster.
If an approximate calculation could work for you (I doubt it is not acceptable, usually it makes sense for that volume to have a % of approximation), you might also consider this technique (that works in VertiPaq) by Phil Seamark: DAX : Approx Distinct Count - Phil Seamark on DAX

Phil also talk about aggregations with DISTINCTCOUNT (it solves performance, not memory): Speed up Distinct Count calculations in Power BI using Aggregations - Phil Seamark on DAX

marcorusso
Most Valuable Professional
Most Valuable Professional

Unfortunately, there is no such thing as a free lunch.

What we describe in the article works as long as the data is small and the dictionary size is big.

When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.

Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.

The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.

It seems you are at the sweet point, meaning you should return to the single column.

The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).

Hi @marcorusso , thank you for your immediate and detailed response.

 

Unfortunately, in my case the distinct count of that column is my main desired metric. So it needs to be used constantly and that results in visuals not loading and the capacity usage  limits being stretched and breached.

 

Also, the dataset size is hitting the capacity's hard limits.

 

Thanks again for all your work and support.

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

October Power BI Update Carousel

Power BI Monthly Update - October 2025

Check out the October 2025 Power BI update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.