Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
gp10
Advocate III
Advocate III

High Cardinality solution by SQLBI not working

Hi beloved community,

I'm trying to implement @marcorusso 's solution on a column with high cardinality but I'm not getting the expected result.

The solution is described in this great article: https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/

 

Context:

  • I'm connecting to an Amazon Redshift db with the Redshift connector of Power BI.
  • The table I want to query is a huge one with billion of rows.
  • One particular integer column (not a primary key) has around 100M distinct values, like an index, starting from 1.
  • If I create a dataset with that column in its original format, the dataset is getting huge and that column takes around 90% of the total size.
  • The result table is being partitioned for incremental refresh.

So, I have created a view in the db where I have splitted that column with the method described in the article. The result is exactly what it should be and the column is splitted in 2 new columns perfectly. I can now create a dataset without any transformation steps in Power Query.

 

But... when I publish the dataset and refresh it, the dataset size remains pretty much the same.


Does anyone have any idea on possible reasons for this?

Thanks.

1 ACCEPTED SOLUTION

Unfortunately, there is no such thing as a free lunch.

What we describe in the article works as long as the data is small and the dictionary size is big.

When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.

Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.

The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.

It seems you are at the sweet point, meaning you should return to the single column.

The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).

View solution in original post

3 REPLIES 3

You might try DirectQuery over SQL Server, but it's going to be more expensive.
The DISTINCTCOUNT algorithm of SQL Server scales better on multiple cores, so if you have good hardware, that calculation could be faster. However, everything else could be slower. You might consider a composite model, but we enter into an area where it really depends on requirements and tradeoffs.
SQL Server also enables the APPROXIMATEDISTINCTCOUNT which is way faster.
If an approximate calculation could work for you (I doubt it is not acceptable, usually it makes sense for that volume to have a % of approximation), you might also consider this technique (that works in VertiPaq) by Phil Seamark: DAX : Approx Distinct Count - Phil Seamark on DAX

Phil also talk about aggregations with DISTINCTCOUNT (it solves performance, not memory): Speed up Distinct Count calculations in Power BI using Aggregations - Phil Seamark on DAX

Unfortunately, there is no such thing as a free lunch.

What we describe in the article works as long as the data is small and the dictionary size is big.

When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.

Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.

The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.

It seems you are at the sweet point, meaning you should return to the single column.

The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).

Hi @marcorusso , thank you for your immediate and detailed response.

 

Unfortunately, in my case the distinct count of that column is my main desired metric. So it needs to be used constantly and that results in visuals not loading and the capacity usage  limits being stretched and breached.

 

Also, the dataset size is hitting the capacity's hard limits.

 

Thanks again for all your work and support.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.