Join us for an expert-led overview of the tools and concepts you'll need to pass exam PL-300. The first session starts on June 11th. See you there!
Get registeredPower BI is turning 10! Let’s celebrate together with dataviz contests, interactive sessions, and giveaways. Register now.
Hi All,
I am trying to find an efficient design pattern to do a merge join in Power Query where the join is not a striaght forward equality join.
I have a table containing sales data and for each sale record want to merge in the product cost from another file. The Sales data comes from a csv file and the product data comes from an Excel spreadsheet. The product costs can change over time so the join needs to join not only on product id, but also compare the sales date against the start and end dates in the product table.
An example of the two tables and the required output are below:
I saw a great video frrm Pragmatic works https://www.youtube.com/watch?v=ckB93I3GpcQ&t=1s&ab_channel=PragmaticWorks which shows one solution of doing this where by he joins over on the product id and then creates a custom column on this table checking the range. This works fine but the performance is terrible. These files are not particularly big but power query is loading gigs of data. I tried using Table.Buffer which decreased the size of the loading but it takes forever to do the processing.
Another good post by RADACAD https://radacad.com/dates-between-merge-join-in-power-query whereby you generate a list of dates for each range and then do a normal equality join. This works but again is not a solution that is going to work with any number of lookup records / large date ranges.
This seems like a pattern that must come up again and again in the wild so am hoping someone can provide some advice on how they implemented a decent solution.
Any help is much appreicated.
Thanks,
Alex
One way to speed up execution is to leverage the fact that the dates are sorted in the lookup list and in the sale date lists of the various products.
In this way, the lists are "traversed" only once.
This in M could be done with the List.Generate function.
What size are your tables.
How many products, how many sales and how many changes in cost?
I have made 2 queries - One based on list approach and one based on Table.SelectRows approach.
This is a mini-mockup which you can fit in your code.
These will definitely improve performance. Let me know whether the improvement is acceptable or not/
The Excel mockup is uploaded to https://1drv.ms/x/s!Akd5y6ruJhvhuT0ZjSMjlq22TTQZ?e=AkZ7gr
let
Source = Excel.CurrentWorkbook(){[Name="Sales"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Product SKU", Int64.Type}, {"SalesDate", type date}}),
StartDateList = List.Buffer(ProductCostLookup[Start Date]),
EndDateList = List.Buffer(ProductCostLookup[End Date]),
CostList = List.Buffer(ProductCostLookup[Cost]),
SKUList = List.Buffer(ProductCostLookup[Product SKU]),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Cost", each CostList{List.PositionOf(List.Transform(List.Positions(StartDateList),(i)=>[SalesDate]>=StartDateList{i} and [SalesDate]<=EndDateList{i} and [Product SKU]=SKUList{i}),true)})
in
#"Added Custom"
Table.SelectRows approach
let
Source = Excel.CurrentWorkbook(){[Name="Sales"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Product SKU", Int64.Type}, {"SalesDate", type date}}),
BufferedTable = Table.Buffer(ProductCostLookup),
#"Added Custom" = Table.AddColumn(#"Changed Type", "Cost", each Table.SelectRows(BufferedTable, (x)=> x[Start Date]<=[SalesDate] and x[End Date]>=[SalesDate] and x[Product SKU]=[Product SKU]){0}[Cost])
in
#"Added Custom"
Thanks a lot for the code. I have implemented it but it is still very slow.
Thanks for this idea, but still it's taking too much time.
I am running almost 10 queries, 7 are based on one in which all these functions ran.
what if I do DAX work where possible can increase the speed?