Don't miss your chance to take the Fabric Data Engineer (DP-700) exam on us!
Learn moreThe FabCon + SQLCon recap series starts April 14th at 8am Pacific. If you’re tracking where AI is going inside Fabric, this first session is a can't miss. Register now
Hello everyone,
I'm currently working in Power query with a fact table that contains around 200 million rows. My goal is to perform a merge operation to incorporate IDs from a dimension table. Given the size of the data, I'm concerned about performance and efficiency. (I tried and it took too many hours, even if I set the enhanced compute engine on, using PPU).
I need to perform multiple merge operations. Merge with dimension tables to bring in the required IDs.
I am wondering if select the columns before merging could improve the performance. So instead of doing just
#"Merged queries 1" = Table.NestedJoin(Source, {"BCode"}, #"dim TB", {"TBCode"}, "TB", JoinKind.LeftOuter)
Do this
TB= Table.SelectColumns(#"dim TB", {"TBCode", "TBID"}),
#"Merged queries 1" = Table.NestedJoin(Source, {"BCode"}, TB, {"TBCode"}, "TB", JoinKind.LeftOuter)
Could this improve performance?
Are there any other strategies or best practices that could help optimize the merge operation?
I’d appreciate any insights or suggestions on how to handle large-scale merges more efficiently. Does selecting only necessary columns prior to merging generally provide significant benefits? Are there other optimization techniques you would recommend?
Thanks in advance for your help!
Yes, selecting only the columns you need on both queries will usually help. Ensuring that your join columns are both integers will also help. What will help you the most is if both of your ID columns are sorted the same way, then instead of using Table.NestedJoin, you can use Table.Join, and add the final parameter, JoinAlgorithm.SortMerge. I assure you that this will speed up your query significantly. You will need to rename one of the join columns so you don't end up with duplicate column names after the join.
--Nate
Ok I Will try tò select the columns as I wrote above, I'll check performance. But as I read on Chris web blog It shouldn't change much.
I can't use sort merge algorithm as I Need tò join the fact table with many different dim tables. Also I can't use integers because I use text codes to bring in IDs. It's sad that there aren't any other options tò improve performance
Mark the code columns as Key could make any differenze?
If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.
A new Power BI DataViz World Championship is coming this June! Don't miss out on submitting your entry.
Share feedback directly with Fabric product managers, participate in targeted research studies and influence the Fabric roadmap.
| User | Count |
|---|---|
| 5 | |
| 4 | |
| 3 | |
| 2 | |
| 2 |
| User | Count |
|---|---|
| 8 | |
| 6 | |
| 6 | |
| 6 | |
| 5 |