The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
Hi, I have c.20 million rows of data with aobut 15 columns per row/transaction but want to limit the output into the Desktop by using the group by function in Power Query. This is meaning I am waiting for a long time to do its calculations.
Is there anything I need to be aware of, ie shoudl I not be grouping the data using an alrerady calculated column etc?
Solved! Go to Solution.
You have 2 options then I reckon. If it's a fabric pipeline, Spark based pipeline or an Azure Data Factory pipeline, you can add transformation steps to do the aggregations and then the Power Query engine won't need to spend time doing that effort.
Alternatively, the better option in my opinion, would be to make a view based on that table, within the SQL Database that you're querying. You can do things like grouping and roll up within SQL statements. If you're not the best at SQL then maybe use ChatGPT to try and convert your m code into SQL code, which you can use to make the view.
Do you apply the simple grouping or advance (using the fourth and fifth argument) one?
Just a simple grouping with 5 groupings
"Month Cohort", "Week Ending", "Confidence Group", "category", "type"
The 1st 2 are using the Date option when creating a new column, based off the existing transaction date column:
= Table.AddColumn(#"Week ending", "Month Name", each Date.MonthName([operation_date]), type text)
"Confidence Group" is simple conditional column, where:
= Table.AddColumn(#"Removed Duplicates", "Confidence Group", each if [confidence] >= 0.5 then "High" else "Low")
Last 2 groupings are based on existing database table columns
To improve the performance of Group By in Power Query with large datasets, filter data first and avoid using complex calculated columns for grouping. Limit the number of grouping columns and use simpler aggregation functions. Ensure data types are correct and consider loading the data into the Data Model for better performance. Lastly, test with smaller samples before applying changes to the full dataset. For more tips and advanced techniques, you can explore this Power BI course that offers in-depth knowledge on optimizing Power BI performance.
Hi,
Thanks for the solution pbiuseruk and BabyYoda offered, and i want to offer some more information for user to refer to.
hello @MattMcMillin_71 , bas on your description, you'd better not use the calculated column to group, it will affect the performance.You can refer to the following solutions:
1.as @pbiuseruk and @BabyYoda mentioned, you can consider to group on data source.
2.Simplify your data types and only load important columns into Power BI, you can refer to the following link.
Optimization guide for Power BI - Power BI | Microsoft Learn
3.If your datasource supports incremental refresh, you can consider to use incremental refresh to process the data instead of the whole data, it can improve the performance, you can refer to the following link about incremental refresh.
Incremental refresh for semantic models in Power BI - Power BI | Microsoft Learn
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Agree with @BabyYoda - if you're using a DB, can you have view created and then query that instead?
Its an SQL database table, which is daily refreshed as a pipeline
You have 2 options then I reckon. If it's a fabric pipeline, Spark based pipeline or an Azure Data Factory pipeline, you can add transformation steps to do the aggregations and then the Power Query engine won't need to spend time doing that effort.
Alternatively, the better option in my opinion, would be to make a view based on that table, within the SQL Database that you're querying. You can do things like grouping and roll up within SQL statements. If you're not the best at SQL then maybe use ChatGPT to try and convert your m code into SQL code, which you can use to make the view.
Can you do the grouping at the source so that you aren't bringing in so many rows? What type of data source is it?