Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hi beloved community,
I'm trying to implement @marcorusso 's solution on a column with high cardinality but I'm not getting the expected result.
The solution is described in this great article: https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/
Context:
So, I have created a view in the db where I have splitted that column with the method described in the article. The result is exactly what it should be and the column is splitted in 2 new columns perfectly. I can now create a dataset without any transformation steps in Power Query.
But... when I publish the dataset and refresh it, the dataset size remains pretty much the same.
Does anyone have any idea on possible reasons for this?
Thanks.
Solved! Go to Solution.
Unfortunately, there is no such thing as a free lunch.
What we describe in the article works as long as the data is small and the dictionary size is big.
When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.
Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.
The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.
It seems you are at the sweet point, meaning you should return to the single column.
The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).
You might try DirectQuery over SQL Server, but it's going to be more expensive.
The DISTINCTCOUNT algorithm of SQL Server scales better on multiple cores, so if you have good hardware, that calculation could be faster. However, everything else could be slower. You might consider a composite model, but we enter into an area where it really depends on requirements and tradeoffs.
SQL Server also enables the APPROXIMATEDISTINCTCOUNT which is way faster.
If an approximate calculation could work for you (I doubt it is not acceptable, usually it makes sense for that volume to have a % of approximation), you might also consider this technique (that works in VertiPaq) by Phil Seamark: DAX : Approx Distinct Count - Phil Seamark on DAX
Phil also talk about aggregations with DISTINCTCOUNT (it solves performance, not memory): Speed up Distinct Count calculations in Power BI using Aggregations - Phil Seamark on DAX
Unfortunately, there is no such thing as a free lunch.
What we describe in the article works as long as the data is small and the dictionary size is big.
When you reduce the dictionary, you also increase the data size (this depends on the combination of different values in the split columns - it's a problem of statistical distribution, but to keep it simple, let's say that the more rows, the larger the number of combinations, the lower the compression.
Now, if you have a relatively large dictionary with a relatively small compressed data size, the trick works, and you have the same RAM.
The bigger the table, the larger the data size - at a certain point, the saving in the dictionary no longer replaces the saving in data size. There is a point where the optimization is actually worsening the situation also in data size.
It seems you are at the sweet point, meaning you should return to the single column.
The good news is that if you publish the database on Power BI service with the large format, the columns are loaded in memory only when it's used, so you don't pay the full memory price unless someone actually uses it (hopefully, no).
Hi @marcorusso , thank you for your immediate and detailed response.
Unfortunately, in my case the distinct count of that column is my main desired metric. So it needs to be used constantly and that results in visuals not loading and the capacity usage limits being stretched and breached.
Also, the dataset size is hitting the capacity's hard limits.
Thanks again for all your work and support.
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
100 | |
100 | |
80 | |
77 | |
66 |
User | Count |
---|---|
134 | |
106 | |
104 | |
85 | |
73 |