I’m trying to reconcile data from two different entities regarding movement of shipping containers, by doing a bulletproof fuzzy match on dates. One is a shipping port, the other is a railway.
Here's a simplified sample file illustrating what I'm trying to do:
Match to closest day_20190722 buffer Revised.xlsx
I work at the port, and I want to ensure that the railway are charging us for the correct amout of containers they are moving to/from the port. But the dates in the two systems don’t always match for the reasons outlined below:
I need to do a "Fuzzy Date match" in order to reconcile one system to the other, that doesn’t inadvertently double count when containers are moved rapidly back and forth between the port and the customer multiple times. So I need to progressively increase the 'mismatch' tolerence on those dates, and remove matches at each pass leaving just the unmatched rows to do increasingly desperate matches on. (If I don't remove the matches at each subsequent pass, then inevitably some container movements get incorrectly matched to their next similar move).
I have a hard-coded query that does this works fine, but it is cumbersome. I wonder if it is possible to use something like List.Accumulate to make the query dynamic. But I have 2 seed tables, and need three output tables, so it might be a pipe dream.
Here's a screenshot of sample input and output. I've color coded the matching records from each table to the 'Matched Records' table below. (Edit: I've revised this as per reply to @ImkeF below)
My manual query performs the following steps:
Here’s how that looks in the query dependency view:
And here's how I do the 'Fuzzy Date Match": At each 'pass', my Day X Matched step takes the unmatched records from the two tables at the end of the previous Day X Matched step, and creates a 'Date Offset' column using this general approach:
= Table.AddColumn(Source, "Table2.Date", each List.Dates(Date.AddDays([Date],-X),2,#duration(2X,0,0,0)))
That X bit successively pads out the date in one table so that it will match dates 1, 2, ..., X days before or after.
All this requires a lot of different Queries. Here's the amount of queries required just to handle 5 days' worth of date offsets:
What I would like to do is use List.Accumulate or similar to make this dynamic. Here’s how that would look:
Uwe from the data-insights blog has a fantastic 3 part series on recursion in PQ, including great references to the usual suspects (Chris Webb, @ImkeF , and others). This shows some great uses of List.Accumulate and List.Generate. But working out whether these functions are flexible enough to do the above is beyond me.
Anybody feeling brave enough to take a stab, or even give me their thoughts as to whether this is possible?
Match to closest day_20190722 buffer Revised.xlsx
I think a "simpler" group and sort would do the job here as well:
1) Group by "ID" and "Direction"
2) Sort by "Date" and add Index column
3) Merge on ID, Direction and new Index column in FullOuter-mode: Matches and non-matches will be shown
See attached file
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries
Hi @ImkeF
Unfortunately that approach isn’t robust under all circumstances, as it will always match an entry in one table with a previous entry in the other table even if a closer match is available. And this is a very real possibility in this dataset.
Revised sample file and screenshot: Match to closest day_20190722 buffer Imke.xlsx
Hi @JeffWeir ,
sorry, I lost track of this thread a bit.
Is this still an issue?
Cheers, Imke
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries
My hard-coded query works fine, but I'm trying to find the time to see if I can learn how to write a dynamic function where I can specify the number of days to check either side. At the moment I progressively match the tables on larger values of a 'tolerance' factor X using this condition:
Table1.Date = Table2.Date +/- X .
And I'm effectiviely running this in a loop for values of X between 1 to 5, and removing successful matches on each pass leaving just unsuccessful ones. This lets me progressively increase the 'mismatch' tolerence on those dates, and remove matches at each pass leaving just the unmatched rows to do increasingly desperate matches on.
My query performs the following steps:
The query dependency tree that results looks like so:
All that takes a lot of setup. I really need to work out if this can be turned into a function that is a) dynamic and b) efficient.
I still mean to take a crack at it, but it might be beyond my beginning/intermediate M.
For what it's worth, here's my current manual approach.
Hi @JeffWeir ,
although a dynamic recursive approach (using List.Accumulate or List.Generate) would work here, that would still be a considerable amount of code (and work for me) and I'm not sure if it would perform faster than the following approach:
let Source = Table1, AllowedRange = Table.AddColumn(Source, "Days", each {-5..5}), ExpandAllowedRange = Table.ExpandListColumn(AllowedRange, "Days"), DaysAbsoluteFigures = Table.AddColumn(ExpandAllowedRange, "AbsDays", each Number.Abs([Days])), AllowedDates = Table.AddColumn(DaysAbsoluteFigures, "Dates", each Date.AddDays([Date], [Days])), #"Changed Type" = Table.TransformColumnTypes(AllowedDates,{{"Dates", type date}}), MergeTable2 = Table.NestedJoin(#"Changed Type", {"Dates", "ID", "Direction"}, Table2, {"Date", "ID", "Direction"}, "Table2", JoinKind.Inner), Cleanup = Table.RemoveColumns(MergeTable2,{"Table2", "Dates"}), #"Sorted Rows" = Table.Buffer(Table.Sort(Cleanup,{{"AbsDays", Order.Ascending}})), #"Removed Duplicates" = Table.Distinct(#"Sorted Rows", {"Date", "ID", "Direction"}) in #"Removed Duplicates"
This produces the "Matched Records"-table. Just do some Anti-Joins to retrieve the unmatched tables from there.
If performance is too bad, please come back.
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries
@ImkeFThat Sorted Rows > Remove Duplicates step is clever.
I'll tweak my code to use this approach and see if there is a performance improvement. I still need to do lots of different passes, because i want this to be a 'progressive' match, so it may not simplify the query chain overly. But it will be interesting to see if there's any performance difference. Thanks 🙂
Hi @JeffWeir
this apprach was supposed to replace your whole chain. I've tweaked to code a little bit to come up with the correct Table2, but now it delivers all you need. No need at all to go through all these iterations. (results are shown in row 42 -57 of Sheet1)
Please check the results in the attached file.
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries
Hi @JeffWeir
have you had a chance to test this out on your real data?
I'd be interested to hear how this scales on a large dataset.
Please mark my answer as solution if it solved your case - thanks.
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries