Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
QuangMC
Frequent Visitor

Group By or Remove duplicates

Hello all,

 

So it's not so much a problem, but rather a question regarding 2 different ways to do something.

 

Context :

I have a fairly big table (around 3 million rows), that i'm currently working on. As usual, it's the sort of table where you have EVERYHTING (you guys know what i'm talking about, right ?). I have removed as much unecessary things as possible and segment my table into multiple ones (Facts and Dims). 

 

It works fine, i can clearly see the impact it has on my report.

 

Question :

There are some places where i use the "Group By" option from Power Query to get some aggregations, but in other places I use it simply to remove other columns and remove duplicates.

 

The thing is, I don't know if the optimal solution. I could also just "Remove unnecessary columns" and then use "Remove duplicates" to obtain the same results. 

 

So my question is, which is the best option performance wise ? I tried using the diagnostics tool, but I have a hard time interpreting the results. Both solutions seem fine, but is one actually better than the other ? If you had a huge dataset, which would be more appropriate ?

 

Thank you in advance.

1 ACCEPTED SOLUTION
Vijay_A_Verma
Super User
Super User

In Diagnostics, pick up the report generated which has aggregated word.

Select Exclusive duration column, Transform tab - Duration - Total seconds - Now Statistics - Sum

Note down the timing of both approaches. Whatever is lesser is the better. 

 

Now the first principle in optimizing query is to perform vertical filtering i.e. remove all unnecessary columns. 
Hence, if you perform remove columns followed by Group by, you will have very good amount of performance increase compared to do plain vanilla Group by.

 

Horizontal filtering (i.e. remove duplicates) is also one of the steps which we can resort to after Vertical filtering. But horizontal filtering will consume some finite time which will be added up to total time but same can be compensated by next steps as they will work on smaller dataset.

 

It is always better to work on a small size rectangle.

 

But you have to see how much time is taken in Query diagnostics on your data set for different approaches and choose the best one. 

View solution in original post

4 REPLIES 4
Vijay_A_Verma
Super User
Super User

In Diagnostics, pick up the report generated which has aggregated word.

Select Exclusive duration column, Transform tab - Duration - Total seconds - Now Statistics - Sum

Note down the timing of both approaches. Whatever is lesser is the better. 

 

Now the first principle in optimizing query is to perform vertical filtering i.e. remove all unnecessary columns. 
Hence, if you perform remove columns followed by Group by, you will have very good amount of performance increase compared to do plain vanilla Group by.

 

Horizontal filtering (i.e. remove duplicates) is also one of the steps which we can resort to after Vertical filtering. But horizontal filtering will consume some finite time which will be added up to total time but same can be compensated by next steps as they will work on smaller dataset.

 

It is always better to work on a small size rectangle.

 

But you have to see how much time is taken in Query diagnostics on your data set for different approaches and choose the best one. 

Hello,

 

Thank you for your full answer. In the end, both solutions are almost the same in terms of performance. I did however try to reduce the data even more BEFORE applying my solutions. It is indeed the best way to reduce the impact on performance.

 

Thanks again !

mahoneypat
Microsoft Employee
Microsoft Employee

When I am comparing two approaches, I just duplicate the query, modify the new one to use the other approach, and load them in parallel and see which wins. You can speed up the testing by keeping fewer rows temporarily.

 

Pat





Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.


@mahoneypa HoosierBI on YouTube


Thank you for the advice. I also thought about it, but couldn't find the time to do so. Pretty easy way to find the best solution.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.