Solved: Group By on (very) large table

andershh · ‎08-19-2016

Hi all,

I am rather new to PowerBI but I hope some of you can help me with this question.

I am working with an Oracle database holding data on compettition participation for customers. Each competition takes place over 6 days, and customers can compete numerous times on the same day.

The data I am working with is structured like this:

Column 1: CompetitionNumber

Column 2: CompetitionDay

Column 3: CustomerID

So as an example a couple of row could look like:

CompetitionNumber	CompetitionDay	CustomerID
1	1	778856
1	1	778856
1	2	778856
1	2	808012

What I am trying to do is to get a measure of how many customers participated each day, so what I am doing is a Group BY on CompetitionNumber and CompetitionDay and counting the number of distinct rows.

However I am running into severe PowerBI performance issues - with PowerBI slowly filling up my PC's RAM over severtal hours doing the Gropu By until PowerBI crashes. The table I am working is very large - holding around 100+ million rows. However, when I try to do this via SQLDeveloper on the Oracle database itself it takes around 10 minutes for Oracle to compute the GroupBY.

As I see it there three different ways forward:

Pull the data out manually using SQLDeveloper and Oracle. I have tried this, but that brings the main downside that my online PowerBI report won't be able to auto refresh itself using my On-premises gateway, so that won't be a long-term solution.
Find some way to let PowerBI connect to the Oracle database using SQL, not PowerQuery (i.e. to let the Oracle database do the heavy lifting in terms of calculations). Could using an R query accomplish this (with something like ROracle) while still being able to use the on-premises gateway for auto refresh?
Split up my PowerBI query into several sub-queries using filters on Competition - i.e. make a PowerBI query that extracts for each competition number. Using competition number level filtering gets the data size down to around 6 million, which I think would be more manageable for PowerBI. And then in the end combine each query into a single table.

What do you think - is there something else I could try to overcome this obstacle?

Many thanks beforehand!

ankitpatira · ‎08-19-2016

@andershh When connecting to Oracle database you have an option to write SQL query. So in my opinion it is better option to bring data in the format required through SQL and let Oracle processing happen on the database as you've already mentioned.

Another option is to use R script within query editor. So only dataset is imported in power bi desktop, run R script to perform grouping via Run R Script option which adds R step to the applied steps. I think that would be second best option as R is very powerful in performing this type of operation and you will notice less amount of time being spend on grouping.

View solution in original post

ankitpatira · ‎08-19-2016

@andershh When connecting to Oracle database you have an option to write SQL query. So in my opinion it is better option to bring data in the format required through SQL and let Oracle processing happen on the database as you've already mentioned.

Another option is to use R script within query editor. So only dataset is imported in power bi desktop, run R script to perform grouping via Run R Script option which adds R step to the applied steps. I think that would be second best option as R is very powerful in performing this type of operation and you will notice less amount of time being spend on grouping.

andershh · ‎08-23-2016

@ankitpatira - Thank you for taking the time to try and help me! I think the SQL idea is very good (to be honest I hadn't noticed the option when connecting to a data source). However, since I am using the same Oracle connection for other PowerBI reports and data tasks - I haven't been able to add a duplicate connection to the database with the SQL inserted. If I change the configuration of the Oracle data source, then I will change it for all my Power BI reports that utilizes it - right?

Nevertheless, if this hadn't been the case I am certain your suggestion would have been an excellent solution - I am therefore marking this a solution.

@AlexChen - Also a big thank you to you for taking the time to write a thorough reply. I have moved forward with the third option I specified in the original. I think your suggestion for fixing the table is spot on - however I do not have the admin rights to the database (I can only extract data from it), so I can't utilize your excellent suggestion. But many thanks nevertheless!

AlexChen · ‎08-22-2016

Hi,

I think the 3rd option is the best.

But I found the another issue is that there are too many duplicated records in your table.

Please refer to my steps below:

1. create a table call competition_copy,

create table competition_copy as select * from competition;

2. remove duplicated records in competition_copy,

delete from competition_info_copy a where a.rowid > any (select b.rowid from competition_info_copy b where b.competition_number = a.competition_number
and b.competition_day = a.competition_day and b.customer_id = a.customer_id)

3. split competition_copy into severial smaller table such as :

create table competition_a as select * from competition_copy a where a.competition_number = 1;

create table competition_b as select * from competition_copy a where a.competition_number = 2;

Remember to add index for these columns that will be used for query frequently.

4. create an insert trigger for table competition.

When you insert a record into table competition, for example, the new records’ competition_number =1, competition_day = 3, customer_id = XX, Check if this record existed in competition_a table or not. If not , insert into competition_a table.

5. Now you can create a table to get your group by records by combine several smaller tables’ combined records

Table =
Union (
                Summarize(
                    Competition_a,
                    Competition_a[competition_number],
                    Competition_a[competition_day],
                    Competition_a[customer],
                    “count”, countax(competition_a, customer_id)
             ),

            Summarize(
                    Competition_b,
                    Competition_b[competition_number],
                    Competition_b[competition_day],
                    Competition_b[customer],
                    “count”, countax(competition_b, customer_id)
             ),

….
)

7. Now you can create a measure to show your count of customers quickly. And it will be auto refreshed using your on-premise gateway.

Best Regards
Alex

Group By on (very) large table

Helpful resources

Power BI Dataviz World Championships

Power BI Monthly Update - December 2025

FabCon Atlanta 2026

How to Get Your Question Answered Quickly

FabCon is coming to Atlanta