Solved: Merge first or remove duplicates (query folding)

akhaliq7 · ‎02-28-2022

I am building a data model and I see in certain tutorials it is good practice to do certain power query steps before others. My question is in a query where I need to do both merges and remove duplicates which operation should I do first for best practice/performance.

BA_Pete · ‎02-28-2022

Hi @akhaliq7 ,

There's quite a lot to unpack with a question like this. A lot of it come down to 'it depends'.

First and foremost, if you're working on a foldable data source (SQL Server, for example) then your real focus needs to be on making sure that query folding is maintained, rather than in what order it happens. On a fully folded query, Power BI will optimise the order of operations behind the scenes when it generates the native query to go to the source.

If you're not working on a foldable source, then you would need to evaluate your question based on your specific scenario. For example, if you have thousands of duplicated records in your data then, predictably, you would likely see performance gains in the merge process by reducing these rows first. However, if your merge operation introduces duplicates through crossjoins, then you'd possibly see overall performance gains by only performing the remove duplicates step once, after the merge has completed.

If you really want to improve your merge performance then don't do them at all - load both tables to the data model and relate them. Let VertiPaq/AS Tabular/DAX do the work.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

View solution in original post

Anonymous · ‎02-28-2022

Here is that article: https://blog.crossjoin.co.uk/2018/03/16/improving-the-performance-of-aggregation-after-a-merge-in-po...

Removing duplicates reduces his query time from 54 seconds to 12 seconds.

--Nate

akhaliq7 · ‎02-28-2022

Thanks for your help

Anonymous · ‎02-28-2022

In my experience it definitely is better to remove the duplicates before the join. Power query is much more efficient when it knows that you have no duplicates in your first table. See @cwebb artices with his "Price Paid" queries. It definitely makes a huge difference in performance, at least is always does for me!

--Nate

BA_Pete · ‎02-28-2022

Hi @akhaliq7 ,

There's quite a lot to unpack with a question like this. A lot of it come down to 'it depends'.

First and foremost, if you're working on a foldable data source (SQL Server, for example) then your real focus needs to be on making sure that query folding is maintained, rather than in what order it happens. On a fully folded query, Power BI will optimise the order of operations behind the scenes when it generates the native query to go to the source.

If you're not working on a foldable source, then you would need to evaluate your question based on your specific scenario. For example, if you have thousands of duplicated records in your data then, predictably, you would likely see performance gains in the merge process by reducing these rows first. However, if your merge operation introduces duplicates through crossjoins, then you'd possibly see overall performance gains by only performing the remove duplicates step once, after the merge has completed.

If you really want to improve your merge performance then don't do them at all - load both tables to the data model and relate them. Let VertiPaq/AS Tabular/DAX do the work.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

akhaliq7 · ‎02-28-2022

I'm quite new to power bi so not doing merges and as your saying using the vertipaq/as tabular/dax way will be difficult to achieve. But I will definetly look into it in the future. Thanks

BA_Pete · ‎02-28-2022

Hi @akhaliq7 ,

Not merging tables, but instead relating them in the data model, is standard practice, not some scary advanced process 🙂

Have a look here around using/creating a STAR SCHEMA model in Power BI. As I said, this is standard best practice so should really be what you are aiming for where possible:

https://docs.microsoft.com/en-us/power-bi/guidance/star-schema

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!