Solved: Power Query Can't Seem to Deal With Calculating an...

JohnFoelster · ‎12-30-2025

Hi.

So, I'm new to Excel Power Query but have lots of experience with SQL databases. I'm trying to do something that I really don't think Power Query should be having as much of a problem with as it seems to be having... and it is having problems.

I have a moderately sized PostgreSQL database of historic election results. I've been trying to transfer it into a complex Excel spreadsheet so I can hand that over to someone who might be interested in checking my work without their having to install PostgreSQL. I've also been checking the data against various demographic stats for the relevant jurisdiction and running simple correlations between the two. PostGres HATED doing this, although this was apparently partly because I built my queries as a series of nested views so that the calculation steps would be easy for a layman to understand. I ended up rewriting the queries calling PostGres' correlation function as Common Table Expressions to get the damned thing to produce an answer at all.

Up to this point, Excel with PowerQuery has done a better job, on the whole. I can't get it to use either the GROUPBY function in Excel or a Power Query group by function to calculate my correlations but I've been able to work around that by outputing the last step of the data before the correlations are calculated into a nice big table on a worksheet and then using Excel VBA to write the formulas with the CORREL function on the final worksheet that references the previous one, rather than the drudgery of going through and selecting the relavant ranges by hand.

But what I'm doing now is breaking its fragile little mind.

Basically, I've got eight big fat tables on eight worksheets that have election results with different types of ballots. One table has all the candidate results for ballots cast in precincts over a number of years, another has the total votes cast in the appropriate races for those candidates in those precincts, another has the candidate results for absentee ballot special precincts, and then another has the total votes cast in the appropriate races for those absentee special precincts... and so on. Biggest one is 26329 lines of data but I've got another database I may want to do the same operation I'm trying now to do now with about 115k lines in the chonkiest table.

What I have been doing up to this point is adding up all the votes cast in various subdivisions of this jurisdiction by ballot type, calculating the percentages earned by these candidates in those subdivisions by ballot type, and comparing those percentages to the percentages earned by those candidates in the overall vote in those same subdivisions, outputting the difference in percentage points. At first this was done by using a series of worksheets which used Excel's GROUPBY function on the values in the previous worksheet and just copying and pasting the formula for calculating a percentage where appropriate but about halfway through the job I realized Power Query offered more flexibility for doing joins, which was not really practical with the above method and started running the calculations with it.

So what I'm trying to do now is to take the average percentage earned by all candidates of a given party in a given year in each type of ballots, and compare that percentage systematically to the percentages on a set of ballot measures voted on at the same time. I'd like to make this comparison down to the precinct level if possible, but right now I'm hitting a roadblock on just doing the comparisons for the whole jurisdiction. I'm only looking at three ballot measures from one year right now, but I may try and extend it to ballot measures in all years later.

The ballot measure results are stored in another set of eight tables on separate worksheets. So basically there's a query that uses a nested join to merge the totals of the candidates on the precinct ballots with the total votes cast on the precinct ballots and calculates the percentages. Then there's a duplicate of that query (I learned referencing queries is bad for performance) which takes the average of the candidates for each year. Then there's another pair of those queries for each ballot type, and a query that calculates the grand total for each candidate and a duplicate of it that calculates the averages. There's a grand total query that adds up all candidate totals by merging the totals queries for each ballot type and then it gets merged with a grand total query for all ballots cast with each ballot types made the same way and then that percentage gets calculated and averaged. Then there's the queries that tally up and calculate the percentages in each category for the ballot measures. Each of these individual queries runs fairly efficiently. But then when you try to merge the query containing the percentages for each ballot measure in each type with each of the queries for each ballot types' averages per candidate... then things slow way the hack down.

I know the first thing I should try is pre-filtering those queries calculating candidate averages to just the year I'm actually going to use, but I'd like to keep open the option of doing this comparison for ballot measures in other years.

I'm sorry to have gone about this for so long, but I wanted to state the problem as clearly as possible.

So what's my solution? Is there a fundamentally more efficient solution then the one I've implemented? Is the problem the nested queries instead of the complexity of the calculation? Do I just need to suck it up and redo all my nested queries in the advanced editor so that each one explicitly states all the fundamental tables as an explicit source and redoes all the cumulative logic as though it were a duplicate query? That'd mimic what I ended up doing PostGres with the CTEs.

Thanks for reading and for any help that can be provided.

cengizhanarslan · ‎12-30-2025

Power Query isn’t bad at percentages, it’s getting slow because your design forces it to repeatedly materialize and merge intermediate results, and (most importantly) you’re likely losing query folding back to PostgreSQL once you start doing complex/nested transforms. When folding breaks, Power Query pulls more data locally and does joins/grouping in the mashup engine, which tanks performance.

A better pattern (very similar to what you ended up doing with CTEs) is:

1) Normalize first, then aggregate once
Right now you have 8 “wide” tables per subject (candidates + totals + measures) and you keep merging them. Instead:

Append the 8 ballot-type tables into one fact table for candidates with a column BallotType
Append the 8 ballot-type tables into one fact table for measures with a column BallotType

This removes the need for many separate merges.

2) Pre-aggregate before you merge
Do the heavy Group By first (Year, Party, BallotType, Geography), compute totals and percentages, and then merge the small aggregated tables.

Example approach:

CandidatesAgg: group by Year, Party, BallotType, Precinct (or Jurisdiction) → sum votes, sum total votes, then %
MeasuresAgg: group by Year, Measure, BallotType, Precinct (or Jurisdiction) → sum yes/no, total, then %

Then merge CandidatesAgg ↔ MeasuresAgg on Year + BallotType + Geography. That join is tiny compared to joining raw rows.

3) Keep folding as long as possible
In Power Query:

Do filtering, column selection, type changes early
Avoid steps that commonly break folding before your final aggregation/merge (custom functions, some “Add Column” with complex logic, merging too early, etc.)
Right-click a step → View Native Query. If it disappears, folding broke before that step.

If folding stays, PostgreSQL will do the joins/grouping fast.

4) Stop duplicating queries for performance
Duplicating often makes things worse. Prefer a single query with clear steps (like a CTE) and only reference the final small output if needed. The real win is folding + pre-aggregation.

5) If you’re exporting to Excel for a layperson
Consider exporting the final aggregated outputs (CandidatesAgg, MeasuresAgg, and the comparison table), not the entire intermediate chain. That keeps refresh light and the workbook understandable.

_________________________________________________________
If this helped, ✓ Mark as Solution | Kudos appreciated
Connect on LinkedIn

View solution in original post

JohnFoelster · ‎12-30-2025

Before I forget, whenever this gets through moderation, thank you very much.

cengizhanarslan · ‎12-30-2025

Power Query isn’t bad at percentages, it’s getting slow because your design forces it to repeatedly materialize and merge intermediate results, and (most importantly) you’re likely losing query folding back to PostgreSQL once you start doing complex/nested transforms. When folding breaks, Power Query pulls more data locally and does joins/grouping in the mashup engine, which tanks performance.

A better pattern (very similar to what you ended up doing with CTEs) is:

1) Normalize first, then aggregate once
Right now you have 8 “wide” tables per subject (candidates + totals + measures) and you keep merging them. Instead:

Append the 8 ballot-type tables into one fact table for candidates with a column BallotType
Append the 8 ballot-type tables into one fact table for measures with a column BallotType

This removes the need for many separate merges.

2) Pre-aggregate before you merge
Do the heavy Group By first (Year, Party, BallotType, Geography), compute totals and percentages, and then merge the small aggregated tables.

Example approach:

CandidatesAgg: group by Year, Party, BallotType, Precinct (or Jurisdiction) → sum votes, sum total votes, then %
MeasuresAgg: group by Year, Measure, BallotType, Precinct (or Jurisdiction) → sum yes/no, total, then %

Then merge CandidatesAgg ↔ MeasuresAgg on Year + BallotType + Geography. That join is tiny compared to joining raw rows.

3) Keep folding as long as possible
In Power Query:

Do filtering, column selection, type changes early
Avoid steps that commonly break folding before your final aggregation/merge (custom functions, some “Add Column” with complex logic, merging too early, etc.)
Right-click a step → View Native Query. If it disappears, folding broke before that step.

If folding stays, PostgreSQL will do the joins/grouping fast.

4) Stop duplicating queries for performance
Duplicating often makes things worse. Prefer a single query with clear steps (like a CTE) and only reference the final small output if needed. The real win is folding + pre-aggregation.

5) If you’re exporting to Excel for a layperson
Consider exporting the final aggregated outputs (CandidatesAgg, MeasuresAgg, and the comparison table), not the entire intermediate chain. That keeps refresh light and the workbook understandable.

_________________________________________________________
If this helped, ✓ Mark as Solution | Kudos appreciated
Connect on LinkedIn

JohnFoelster · ‎12-30-2025

It would appear that my replies are being held in moderation?

JohnFoelster · ‎12-30-2025

That's a good place to start from. I'll look at implementing these suggestions.

Getting data from PostGres is simply not happening. They're historical election results, so they don't change. I was comfortable just slopping them all out onto the worksheet. There's also only four ballot types, so a total of eight wide tables, I'm sorry if I was unclear on that. (Then and again, this place has a large number of absentee and provisional subtypes that get counted separately but usually get rolled up into one district wide special precinct in the publicly released counts. They may ultimately need to be included.)

Part of the problem with both this mess and the original morass of a PostGres system was that I generally did the first thing that worked and gave me an answer rather than sit down and think out in advance what the optimal solution would be.

In this case, for a previous attempt at doing only what was needed to generate the final visual table I wanted, I used the following process without Power Query. Dumping out the data into the big worksheet wide tables for each ballot type, using Excel GROUPBY to aggregate on subsequent totals worksheets and then calculating the percentages on an additional worksheet referencing each of the previous set, then another sheet calculating the percentage point differences and a final one using GROUPBY on that sheet to get the averages by year. That system basically got copied and pasted into the current, more complex Excel sheet when I started working on it and I only discovered Power Query midway through the job.

If I'm understanding your point 5) correctly, keep the intermediate queries in Power Query as connection only and do not export them to Excel worksheet as tables. Have I got that right?

cengizhanarslan · ‎12-30-2025

Yes, that’s exactly right.

If the source data is static and already dumped into Excel tables, you’ll get the best performance by:

Keeping all intermediate Power Query steps as “Connection only” (don’t load them to worksheets)
Loading only the final outputs you actually need (e.g., CandidatesAgg, MeasuresAgg, Final comparison)

Reasos is that every “Load to sheet” creates a materialized table that Excel has to store, refresh, and sometimes recompute dependencies for, which slows everything down.

A practical setup for your case:

Raw wide tables (loaded once)
Your 4 ballot types × (candidate + totals) etc. stay as the raw tables.
Power Query staging queries (Connection only)

Append ballot types into one long table (add BallotType)
Clean / select only needed columns
Group to your required grains (year/party/geo)

Final query outputs (Load to sheet)
Only load:

the final comparison table(s) you chart/report from
any small “check my work” tables you want the reviewer to see

One more tip since you’re no longer using PostgreSQL: Append first, then Group By (so you group once). That will usually be much faster than doing “group per ballot type then merge”.

_________________________________________________________
If this helped, ✓ Mark as Solution | Kudos appreciated
Connect on LinkedIn