Solved: Re: High Cardinality solution by SQLBI not working

gp10 · ‎01-18-2024

Hi beloved community,

I'm trying to implement @marcorusso 's solution on a column with high cardinality but I'm not getting the expected result.

The solution is described in this great article: https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/

Context:

I'm connecting to an Amazon Redshift db with the Redshift connector of Power BI.
The table I want to query is a huge one with billion of rows.
One particular integer column (not a primary key) has around 100M distinct values, like an index, starting from 1.
If I create a dataset with that column in its original format, the dataset is getting huge and that column takes around 90% of the total size.
The result table is being partitioned for incremental refresh.

So, I have created a view in the db where I have splitted that column with the method described in the article. The result is exactly what it should be and the column is splitted in 2 new columns perfectly. I can now create a dataset without any transformation steps in Power Query.

But... when I publish the dataset and refresh it, the dataset size remains pretty much the same.

Does anyone have any idea on possible reasons for this?

Thanks.

marcorusso · ‎01-18-2024

Unfortunately, there is no such thing as a free lunch.

What we describe in the article works as long as the data is small and the dictionary size is big.

When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.

Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.

The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.

It seems you are at the sweet point, meaning you should return to the single column.

The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).

View solution in original post

marcorusso · ‎01-18-2024

You might try DirectQuery over SQL Server, but it's going to be more expensive.
The DISTINCTCOUNT algorithm of SQL Server scales better on multiple cores, so if you have good hardware, that calculation could be faster. However, everything else could be slower. You might consider a composite model, but we enter into an area where it really depends on requirements and tradeoffs.
SQL Server also enables the APPROXIMATEDISTINCTCOUNT which is way faster.
If an approximate calculation could work for you (I doubt it is not acceptable, usually it makes sense for that volume to have a % of approximation), you might also consider this technique (that works in VertiPaq) by Phil Seamark: DAX : Approx Distinct Count - Phil Seamark on DAX

Phil also talk about aggregations with DISTINCTCOUNT (it solves performance, not memory): Speed up Distinct Count calculations in Power BI using Aggregations - Phil Seamark on DAX

marcorusso · ‎01-18-2024

Unfortunately, there is no such thing as a free lunch.

What we describe in the article works as long as the data is small and the dictionary size is big.

When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.

Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.

The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.

It seems you are at the sweet point, meaning you should return to the single column.

The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).

gp10 · ‎01-18-2024

Hi @marcorusso , thank you for your immediate and detailed response.

Unfortunately, in my case the distinct count of that column is my main desired metric. So it needs to be used constantly and that results in visuals not loading and the capacity usage limits being stretched and breached.

Also, the dataset size is hitting the capacity's hard limits.

Thanks again for all your work and support.

High Cardinality solution by SQLBI not working

Helpful resources

Microsoft Fabric Learn Together

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Your Question Answered Quickly