Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
I was curious whether it would be quicker for a pc to process the following:
distinct ( union ( distinct ( table 1 ) , distinct ( table 2 ) ) )
than
distinct ( union ( table 1 , table 2 ) )
My thought was that perhaps when you distinct a union, it put the whole of table 1 and table 2 in memory, but when you distinct on a union between already 2 distinct tables, the tables in memory are much smaller. But this depends on that the separate distinct tables are not stored in full in its memory. Anyone knows?
Hi @DouweMeer ,
I don't quite understand the following sentence.
But this depends on that the separate distinct tables are not stored in full in its memory.
My understanding is as follows. Please let me know if there is anything wrong.
DAX query engine has two execution units, higher-level - the formula engine and lower-level - the storage engine (VertiPaq).
The formula engine can handle all the operations requested by DAX functions and solve complex DAX expressions.
The goal of the storage engine is to scan the VertiPaq storage and produce datacaches, which are then read by the formula engine.
Based on my understanding, all model data is compressed and stored in VertiPaq. The storage engine receives requests from the formula engine. Then a cache system stores the results produced by the storage engine, holding a limited number of results. When the storage engine receives an internal query identical to one already in cache, it returns the corresponding datacache without doing any scan of data in memory.
So, the two distinct tables are in cache as well as the "union(table1, table2)".
Best Regards,
Icey
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hello @Icey
Thank you for your response.
It seems there is a contradiction in your explanation.
DAX query engine has two execution units, higher-level - the formula engine and lower-level - the storage engine (VertiPaq).
And...
The goal of the storage engine is to scan the VertiPaq storage and produce datacaches, which are then read by the formula engine.
What storage engine are you referring to when this storage engine is scanning the VertiPaq storage?
Perhaps an example:
VAR t1 = 'table 1'
/* This table reference contains 1000 records and 5 columns, so 5000 'points' */
VAR t2 = filter ( 'table 1' , 'table 1'[column 1] = 'current table'[column 1] )
/* So table reference of 'table 1' has 1000 records, but the returned one gives only 100 records, so 500 'points' */
VAR t3 = filter ( t1 , [column 1] = 'current table'[column 1] && [column 2] = 'current table'[column 2] )
/* Say the returned table hold 10 records, so 50 'points' */
VAR t4 = filter ( t2 , [column 2] = 'current table'[column 2] )
/* This would return the same amount of records as t3 */
Based on your earlier explanation, I would suggest that t4 is the most efficient way as the t2 table reference has already been cached in VertiPaq and has a much smaller size than the t1 table reference. So when the custom column goes to the next record, the apply of the filter is quicker done on the much smaller table reference in t2 of only 500 'points' than the much larger t1 of 5000 'points'.
So back to my original question, if I understand correctly that when 'table 1' and 'table 2' table references are static during calculations, the cached table reference gained from both the DAX expressions will be identical in size. However when, for example, 'table 1' would be dynamic, it is more efficient to use:
VAR t1 = distinct ( 'table 2' )
VAR t2 = distinct ( 'table 1' , t1 )
Than
VAR t2 = distinct ( 'table 1' , distinct ( 'table 2' )
However... there would be no difference between the one with VAR t1 and:
VAR t1 = distinct ( 'table 2' )
VAR t2 = distinct ( distinct ( 'table 1' ) , t1 )
Am I correct in my understanding of your explanation on how the engine work and how a change in DAX expression for the table references could improve the calculation speed on custom columns?
Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early Bird pricing ends December 9th.
User | Count |
---|---|
94 | |
84 | |
84 | |
72 | |
49 |
User | Count |
---|---|
143 | |
131 | |
109 | |
64 | |
55 |