Optimizing merge operation for large table

Gabry · ‎09-13-2024

Hello everyone,

I'm currently working in Power query with a fact table that contains around 200 million rows. My goal is to perform a merge operation to incorporate IDs from a dimension table. Given the size of the data, I'm concerned about performance and efficiency. (I tried and it took too many hours, even if I set the enhanced compute engine on, using PPU).

I need to perform multiple merge operations. Merge with dimension tables to bring in the required IDs.

I am wondering if select the columns before merging could improve the performance. So instead of doing just

#"Merged queries 1" = Table.NestedJoin(Source, {"BCode"}, #"dim TB", {"TBCode"}, "TB", JoinKind.LeftOuter)

Do this

TB= Table.SelectColumns(#"dim TB", {"TBCode", "TBID"}),
#"Merged queries 1" = Table.NestedJoin(Source, {"BCode"}, TB, {"TBCode"}, "TB", JoinKind.LeftOuter)

Could this improve performance?

Are there any other strategies or best practices that could help optimize the merge operation?

I’d appreciate any insights or suggestions on how to handle large-scale merges more efficiently. Does selecting only necessary columns prior to merging generally provide significant benefits? Are there other optimization techniques you would recommend?

Thanks in advance for your help!

Anonymous · ‎09-13-2024

Yes, selecting only the columns you need on both queries will usually help. Ensuring that your join columns are both integers will also help. What will help you the most is if both of your ID columns are sorted the same way, then instead of using Table.NestedJoin, you can use Table.Join, and add the final parameter, JoinAlgorithm.SortMerge. I assure you that this will speed up your query significantly. You will need to rename one of the join columns so you don't end up with duplicate column names after the join.

--Nate

Gabry · ‎09-13-2024

Ok I Will try tò select the columns as I wrote above, I'll check performance. But as I read on Chris web blog It shouldn't change much.

I can't use sort merge algorithm as I Need tò join the fact table with many different dim tables. Also I can't use integers because I use text codes to bring in IDs. It's sad that there aren't any other options tò improve performance

Mark the code columns as Key could make any differenze?