The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
I'm seeing something wrong with the fuzzy match scoring on some of the matches and hoping someone can help me understand why this is occuring. I have two tables. My first table called "AS Employee ID" has around 500 rows with two columns called "Request ID" and "AssetSteward Leader Name". The asset steward name column is text and is a free form field that holds a person's name. Unfortunately, the user can enter the name any number of ways...first name last name or last name comma first name, etc...
My 2nd table named "Employee" has over 35,000 records and each row is an employee record. There is a text column called "EmpName". All names are entered Last Name then a comma with a space followed by the first name. Sometimes there may be another space and a middle initial or middle name on the end. Not always though.
I've merged the tables using a simple outer fuzzy join which works great on most matches. However, there are some that just don't make any sense. An example is the match for an asset steward named "Sayers, John". The merge details and results are shown below. I don't understand why it is scoring what should be the closest match (row 7) with a score of 0.83. Row 7 is almost an identical match minus the space and an "O" MI on the end. What is going on with this? Ultimately, I want to limit the matches to one but that gives me "Waters, John" instead of "Sayers, John O" which obviously isn't correct. The score just doesn't seem to make any sense based on what I've read. It seems to be only scoring the first name instead of the entire string and disregarding the last name completely. Thank you for the help! This is going to drive me to drink.
= Table.FuzzyNestedJoin(#"Expanded Employee", {"AssetSteward Leader Name"}, Employee, {"EmpName"}, "Employee", JoinKind.LeftOuter, [IgnoreCase=false, IgnoreSpace=false, NumberOfMatches=20, Threshold=.65, SimilarityColumnName="Score"])
Solved! Go to Solution.
Actually documentation is clear that "The best scenario for applying the fuzzy match algorithm is when all text strings in a column contain only the strings that need to be compared and no extra components. For example, comparing Apples against 4ppl3s yields higher similarity scores than comparing Apples to My favorite fruit, by far, is Apples. I simply love them!. Because the word Apples in the second string is only a small part of the whole text string, that comparison yields a lower similarity score."
https://docs.microsoft.com/en-us/power-query/fuzzy-matching
Hence, to get the right match, you need to have a single word. Hence, you need to match SayersJohn not Sayers, John.
Just have one intermediate step. In first table i.e. where you want match insert one more column which should have following formula (please replace Name with your column name)
= Text.Replace([Name],", ","")
Now, you should perform fuzzy matching on this column to get right result. (You need not make this change in lookup table).
After fuzzy matching, you can delete this column from your result table.
Problem is that you applied the formula on both the columns. I specifically wrote this that you need to apply this only in source column. You are supposed to apply this formula only in AssetSteward Leader Name column. Don't apply this formula on Emp Name column.
Then you would get right result.
I have demonstrated this in this file where I have shown merge as you were originally doing, then merge where I applied the formula on both source and target columns and merge where I applied the formula on source column only - https://1drv.ms/u/s!Akd5y6ruJhvhuW0f_3h5TxwnKmX7?e=JzLdWS
Actually documentation is clear that "The best scenario for applying the fuzzy match algorithm is when all text strings in a column contain only the strings that need to be compared and no extra components. For example, comparing Apples against 4ppl3s yields higher similarity scores than comparing Apples to My favorite fruit, by far, is Apples. I simply love them!. Because the word Apples in the second string is only a small part of the whole text string, that comparison yields a lower similarity score."
https://docs.microsoft.com/en-us/power-query/fuzzy-matching
Hence, to get the right match, you need to have a single word. Hence, you need to match SayersJohn not Sayers, John.
Just have one intermediate step. In first table i.e. where you want match insert one more column which should have following formula (please replace Name with your column name)
= Text.Replace([Name],", ","")
Now, you should perform fuzzy matching on this column to get right result. (You need not make this change in lookup table).
After fuzzy matching, you can delete this column from your result table.
Thanks for the guidance. I actually added the space in one of my attempts to get more accurate matching. I did as you instructed, removed it and it still gives me the same results. It really looks like it is only matching the string after the comma and giving no consideration to the string before the comma. Forgive me but I don't see ANY scenario where "Hayes, John" should be scored higher than "Sayers, John" when comparing both to "Sayers, John O". With or without a space. "Sayers, John" is an EXACT match minus two characters (a space and an "O") on the end. Why in the world would "Hayes, John" ever be considered a closer match?
Any other ideas to try?
Here is what I get when removing the space after the comma.
Problem is that you applied the formula on both the columns. I specifically wrote this that you need to apply this only in source column. You are supposed to apply this formula only in AssetSteward Leader Name column. Don't apply this formula on Emp Name column.
Then you would get right result.
I have demonstrated this in this file where I have shown merge as you were originally doing, then merge where I applied the formula on both source and target columns and merge where I applied the formula on source column only - https://1drv.ms/u/s!Akd5y6ruJhvhuW0f_3h5TxwnKmX7?e=JzLdWS
OK. I went back and removed the comma from the Asset Steward Name coulmn only and performed the fuzzy merge and it worked! I don't understand why but I am just going to take it and move on. Thanks for the help!
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Power BI update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
114 | |
79 | |
78 | |
47 | |
39 |
User | Count |
---|---|
147 | |
115 | |
65 | |
64 | |
53 |