Hello
Power BI seems to work in a way that means it cannot get the best out of Azure SQL Data Warehouse - more specifically Power BI seems likely to generate unnecessary data movement between the compute nodes. I want to outline my use case, check my understanding and see if anyone else has any comments/thoughts…
Example: I have the following (subset of) tables containing web analytics data from an e-Commerce solution:
Session
PageView
PageEvent
(One session has multiple page views, and each page view has multiple events).
Session joins to PageView on SessionID
PageView joings to PageEvent on PageViewID
In Azure SQL DW, all three tables contain the SessionID and all three tables are Hash distributed on this field.
This means all the data for a particular session resides within one compute node in the Azure SQL DWH.
This means hand-written analytical queries typically involve no data movement between compute nodes (other than to return the final results) as they are written roughly as follows:
SELECT ...
FROM Session s
INNER JOIN PageView pv
ON s.SessionID = pv.SessionID
INNER JOIN PageEvent pe
ON pv.PageViewID = pe.PageViewID and
pv.SessionID = pe.SessionID
…
When Power BI is connected to Azure DWH via Direct Query, it presumably won't generate the last line in the query above (the additional join criteria on SessionID), since it only supports one field joins.
This looks like it will lead to additional data movement between the compute nodes of an Azure SQL DWH instance.
I have investigated this by running a query (from Management Studio) similar to the above with and without the last line, and reviewing the steps in the query execution from sys.dm_pdw_request_steps.
With the “pv.SessionID = pe.SessionID” line:
step_index operation_type distribution_type location_type
0 OnOperation Unspecified Control
1 PartitionMoveOperation Unspecified DMS
2 ReturnOperation Unspecified Control
3 OnOperation Unspecified Control
Without the “pv.SessionID = pe.SessionID” line:
step_index operation_type distribution_type location_type
0 RandomIDOperation Unspecified Control
1 OnOperation AllComputeNodes Compute
2 BroadcastMoveOperation Unspecified DMS
3 OnOperation Unspecified Control
4 PartitionMoveOperation Unspecified DMS
5 OnOperation AllComputeNodes Compute
6 ReturnOperation Unspecified Control
7 OnOperation Unspecified Control
This additional join criteria is important since it enforces that the scope of the query is within each control node and so no data movement between nodes is needed until returning results.
This seems to make Power BI less useful as a client to generally browse an Azure SQL Data Warehouse. Of course, it is still possible to write specific queries using the additional join criteria, but then that isn’t using Power BI as a general DWH client.
Just in case anyone else is encoutering the problem above, the recently added COMBINEVALUES() function helps in the above scenario. Details already blogged at:
https://www.sqlbi.com/articles/using-combinevalues-to-optimize-directquery-performance
Hi @cbailiss,
What do you mean about "When Power BI is connected to Azure DWH via Direct Query, it presumably won't generate the last line in the query above (the additional join criteria on SessionID), since it only supports one field joins."? Do you mean you write the specific query when connect to the Azure SQL data warehouse? Or you just connect to those three tables, and merge queries in Query Editor?
When you write specific query in DirectQuery, you can use SQL Profile to trace query execution. Also you can try to connect to Azure SQL data warehouse in Import mode, then write the query to see if the results is the same as in SSMS.
If you just connect to three tables, then you can open Query Editor, use Merge Queries to merge PageView and PageEvent twice. For more information, please refer this article: Shape and combine data in Power BI Desktop.
Best Regards,
Qiuyun Yu
Hello @v-qiuyu-msft
Thank you for the reply.
>> Do you mean you write the specific query when connect to
>> the Azure SQL data warehouse?
No, not writing a specific query.
>> Or you just connect to those three tables, and merge
>> queries in Query Editor?
No, not joining three tables in the query editor.
I connect to the Azure DWH via a direct query connection.
Then I select those three tables.
I don't join the tables in the query editor.
The three tables come through to Power BI as three tables in the Power BI Model.
In the diagram view in Power BI I create the relationships between the tables. (These relationships can only be between one field in each table).
>> If you just connect to three tables, then you can open
>> Query Editor, use Merge Queries to merge PageView
>> and PageEvent twice. For more information, please
>> refer this article: Shape and combine data in Power BI Desktop.
I don't want to merge the tables in the query editor, because this limits the analysis that can be done in Power BI.
This scenario is about creating a general Power BI model that reflects the way the data is structured in the DWH.
I.e. the Power BI model is acting as a general client to the DWH.
But when used in this way, Power BI desktop cannot generate the additional join criteria highlighted in my first post (since it supports only one field joins) - this means unnecessary data movement steps appear in the query plan in Azure DWH.
Merging tables in the Query Editor is OK when performing a very specific analysis (but even then, I think the join is only on one one field if done via the UI, so the same problem happens).
But merging tables in the query stage doesn't work when wanting to use Power BI as a general DWH client.
Also, importing instead of DirectQuery doesn't work at the scale of data in Azure DWH. I.e. it is far bigger than fits in a Power BI Online model.
Thanks
Chris
Join us for a free, hands-on Microsoft workshop led by women trainers for women where you will learn how to build a Dashboard in a Day!