The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends September 15. Request your voucher.
I am building a data model and I see in certain tutorials it is good practice to do certain power query steps before others. My question is in a query where I need to do both merges and remove duplicates which operation should I do first for best practice/performance.
Solved! Go to Solution.
Hi @akhaliq7 ,
There's quite a lot to unpack with a question like this. A lot of it come down to 'it depends'.
First and foremost, if you're working on a foldable data source (SQL Server, for example) then your real focus needs to be on making sure that query folding is maintained, rather than in what order it happens. On a fully folded query, Power BI will optimise the order of operations behind the scenes when it generates the native query to go to the source.
If you're not working on a foldable source, then you would need to evaluate your question based on your specific scenario. For example, if you have thousands of duplicated records in your data then, predictably, you would likely see performance gains in the merge process by reducing these rows first. However, if your merge operation introduces duplicates through crossjoins, then you'd possibly see overall performance gains by only performing the remove duplicates step once, after the merge has completed.
If you really want to improve your merge performance then don't do them at all - load both tables to the data model and relate them. Let VertiPaq/AS Tabular/DAX do the work.
Pete
Proud to be a Datanaut!
Here is that article: https://blog.crossjoin.co.uk/2018/03/16/improving-the-performance-of-aggregation-after-a-merge-in-po...
Removing duplicates reduces his query time from 54 seconds to 12 seconds.
--Nate
Thanks for your help
In my experience it definitely is better to remove the duplicates before the join. Power query is much more efficient when it knows that you have no duplicates in your first table. See @cwebb artices with his "Price Paid" queries. It definitely makes a huge difference in performance, at least is always does for me!
--Nate
Hi @akhaliq7 ,
There's quite a lot to unpack with a question like this. A lot of it come down to 'it depends'.
First and foremost, if you're working on a foldable data source (SQL Server, for example) then your real focus needs to be on making sure that query folding is maintained, rather than in what order it happens. On a fully folded query, Power BI will optimise the order of operations behind the scenes when it generates the native query to go to the source.
If you're not working on a foldable source, then you would need to evaluate your question based on your specific scenario. For example, if you have thousands of duplicated records in your data then, predictably, you would likely see performance gains in the merge process by reducing these rows first. However, if your merge operation introduces duplicates through crossjoins, then you'd possibly see overall performance gains by only performing the remove duplicates step once, after the merge has completed.
If you really want to improve your merge performance then don't do them at all - load both tables to the data model and relate them. Let VertiPaq/AS Tabular/DAX do the work.
Pete
Proud to be a Datanaut!
I'm quite new to power bi so not doing merges and as your saying using the vertipaq/as tabular/dax way will be difficult to achieve. But I will definetly look into it in the future. Thanks
Hi @akhaliq7 ,
Not merging tables, but instead relating them in the data model, is standard practice, not some scary advanced process 🙂
Have a look here around using/creating a STAR SCHEMA model in Power BI. As I said, this is standard best practice so should really be what you are aiming for where possible:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
Pete
Proud to be a Datanaut!