Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.
@colinlewissmith @Amaro @Vijay_A_Verma
Hi All,
I've single table (Employee) to perform Fuzzy Merge/Match on Employee_Name within department_id, so that I can see matching / duplicate employee names side by side within same department.
Here is the table structure.
How to perform exatch match on 'Department_Id' and fuzzy match on Employee_Name?
Best Regards,
Vinod
Solved! Go to Solution.
So I understood correctly that you have 100K employees.
But he size of the departments (companies) is 10 times as large as I assumed.
So I tested with 100K employees in 288 departments with on average around 350 emplyees each. I used a normal ditribution for the number of employees per department so that 90% is between 300 and 400 employees.
And I have some good news for you!
It is even quicker! It is now about 1 minute on my laptop.
So the performace of the FuzzyJoin is not lineair with the number of comparisons it needs to make, but gets relatively more efficient with a bigger dataset.
Here the final query:
let
Source = #"Employee Data",
#"Grouped Rows" = Table.Group(Source, {"Department_Id"}, {{"Employees", each _, type table [Department_Id=nullable number, Employee_Id=nullable number, Employee_Name=nullable text]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each // the following executes for each row.
let // to break up a otherwise complex fromula (does not impact perfomance)
// load this department employees in memory
employee_buffer = [Employees], // Table.Buffer([Employees]), // tried buffering, but did not make any difference
// Create a table with matching employee names
Matched = Table.FuzzyNestedJoin(employee_buffer, {"Employee_Name"}, employee_buffer, {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.9, NumberOfMatches=1]),
// Add the Employee Id's and Employee Names of the matches
expanded = Table.ExpandTableColumn(Matched, "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"})
in
// Select only the employees not matching themselves and return the outcome
Table.SelectRows(expanded, each [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
),
// Get the Employee_id and Employee Names into the departments table
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}, {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each [Employee_Id] <> null)
in
#"Filtered Rows"
Please mark this as solution if you can use it (even if it only helps you to decide NOT to use PowerQuery for the fuzzy matching...)
At my laptop now...
Starting with
Here the M-Code (all but the final filter condition completly done with the UI
let
Source = Excel.CurrentWorkbook(){[Name="Employee"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Department_Id", Int64.Type}, {"Employee_Id", Int64.Type}, {"Employee_Name", type text}}),
#"Merged Queries1" = Table.FuzzyNestedJoin(#"Changed Type", {"Employee_Name"}, #"Changed Type", {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.2]),
#"Expanded Fuzzy Matched Employees" = Table.ExpandTableColumn(#"Merged Queries1", "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
// Result includes matches accross depratments and each employee always matches itself.
// Select all rows where De Department is the same as the matched departement and the empoyee id differes from the machted employee
#"Filtered Rows" = Table.SelectRows(#"Expanded Fuzzy Matched Employees", each [Fuzzy Matched Employees.Department_Id] = [Department_Id] and [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
in
#"Filtered Rows"
Resulting in only 3 matches found.
play around with the Threshold of the FuzzyNestedJoin to get more or fewer matches....
Hi @vinod_excel ,
Power Query does not support exact and fuzzy matching at the same time. You must either perform exact matching on both columns or perform fuzzy matching on both columns.
Please try the solutions provided by @Omid_Motamedise and @PwerQueryKees , If it works, please accept their response as the solution. If this doesn't solve your problem, please provide some sample data and the corresponding expected results so that we can try to find a workaround for you.
Best Regards,
Dino Tao
To do this, you need to apply the fuzzy merge based on Employee_Name only, then in the resulted rows, filter (by using Table.SelectRows) the resulted rows with the same Department_Id.
If this answer helped resolve your issue, please consider marking it as the accepted answer. And if you found my response helpful, I'd appreciate it if you could give me kudos.
Thank you!
Thanks @Omid_Motamedise
I was thinking about same solution but my data is huge (100K+ rows) and concenred about performance issue with fuzzy merge based on Employee_Name only but not by Department_Id wise.
Step#1: Initial process takes time to merge across all departments
Step#2: Each row comparison to check for department ids are same or not
Step#3: Filterout rows where department ids are not same
Step#4: Final output filter out rows with same employee id within each row
Also I'm expecting results to match with Excel Fuzzy Lookup Add-In output (department_id - exacth match, employee_name - default).
Best Regards,
Vinod
At my laptop again.
I took a different approach now.
I tested it with a 100000 row Employee table.
This loads at about 60 rows per second. My test datset has 25 people in each department. Making departments larger or smaller may give different results.
My original query loaded a 6 rows/sec. So about a 10-fold speed improvement.
Here the M Code.
let
Source = #"Employee Data",
#"Grouped Rows" = Table.Group(Source, {"Department_Id"}, {{"Employees", each _, type table [Department_Id=nullable number, Employee_Id=nullable number, Employee_Name=nullable text]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each // the following executes for each row.
let // to break up a otherwise complex fromula (does not impact perfomance)
// Create a table with matching employee names
Matched = Table.FuzzyNestedJoin([Employees], {"Employee_Name"}, [Employees], {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.2]),
// Add the Employee Id's and Employee Names of the matches
expanded = Table.ExpandTableColumn(Matched, "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"})
in
// Select only the employees not matching themselves and return the outcome
Table.SelectRows(expanded, each [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
),
// Get the Employee_id and Employee Names into the departments table
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}, {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each [Employee_Id] <> null)
in
#"Filtered Rows"
Total load time for 8069 matches was 2:35. So on average 52 matches per second.
Thanks @PwerQueryKees
2:35 minutes is huge time for 8K matches. Mine is 80K+ records so it would be 10 times to process and the data keep on growing. So in current volume it takes around 20+ minutes.
EDIT: Halving the number of matches is next to impossible because you can't controll the FuzzyNestedJoin() to consider only a part of the records.
I understood from your ealier message that your total datset was 100K rows. So 8K matches seemed reasonable.
If you have 80K duplicate names out of 100K this means (almost) every employee has a duplicate.
Anyway...
The speed of my solution scale lineair with the number of departments and quadratic with the (average) number of employees per department. The solution @Omid_Motamedise gave you is quadratic with the total number of employees.
So Omid's appraoch requires 100K * 100K = 10 miljon matches,
My approach requires 4000 * 25 * 25 = 4K * 225 = 900K matches
And keep in mind I tested on my laptop. Your hardware maybe be more capable ...
You could possible cut the times in half again if you do every match only once. So if you match E1 with E2 you do not need to match E2 to E1. So a bit over a minute.
I would add an index to each departments employee table and only match each employees name with the employees having a higher index.
If you want the matchin name to appear on both employees, you would need to do that at the end, adding some additional time...
No time at the moment to work that one out for you...
But if
I could take a swing at it. Although I would be really surprised (and proud) to get it below 1 minute 15 secs.
@PwerQueryKees Thanks again for taking time to elaborate the solution.
Your suspect is not correct, I've employee names but in place of department_id I've company_ids.
I need to deal with 300+ company_ids and each company may have multiple employee_ids created for same employee with some text change in their name from time to time...The change may be at positions of first name, last name, middle name, first & last name combined without space or given with comma, earlier entry might be having achronym but latest name of same employee name provided with fullly abrivated, etc...
So I need to check for duplicate employee names within company_id and remap latest employee_id with old employee_id or vice versa.
Hope I made it clear with details that I was looking.
That clarifies your use case somewhat. So 300 companies. How many employees per compny on average? And are they very much of similar size or is there a lot of variation? What is the range of sizes for 90% of you companies? (So, ignoring the extremes?)
Keep in mind that 2.5 minutes to match a total of 100K employees is not bad. Even if you need to do this daily.
Let me know if you are still interested in a solution or maybe the soltion I gave you is satisfactory? Please mark it as solution in that case....
It's hard to say average employees in each company, as it will not be consistent always as data processing/reconsilation happens from multiple data sources. If I've to say for sure then consider the average would be 600 to 700+ employees.. some times it will be 2 digit in few companies.
Still I'm not confident on 2.5 minutes processing time for 100K employees.
So I understood correctly that you have 100K employees.
But he size of the departments (companies) is 10 times as large as I assumed.
So I tested with 100K employees in 288 departments with on average around 350 emplyees each. I used a normal ditribution for the number of employees per department so that 90% is between 300 and 400 employees.
And I have some good news for you!
It is even quicker! It is now about 1 minute on my laptop.
So the performace of the FuzzyJoin is not lineair with the number of comparisons it needs to make, but gets relatively more efficient with a bigger dataset.
Here the final query:
let
Source = #"Employee Data",
#"Grouped Rows" = Table.Group(Source, {"Department_Id"}, {{"Employees", each _, type table [Department_Id=nullable number, Employee_Id=nullable number, Employee_Name=nullable text]}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each // the following executes for each row.
let // to break up a otherwise complex fromula (does not impact perfomance)
// load this department employees in memory
employee_buffer = [Employees], // Table.Buffer([Employees]), // tried buffering, but did not make any difference
// Create a table with matching employee names
Matched = Table.FuzzyNestedJoin(employee_buffer, {"Employee_Name"}, employee_buffer, {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.9, NumberOfMatches=1]),
// Add the Employee Id's and Employee Names of the matches
expanded = Table.ExpandTableColumn(Matched, "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"})
in
// Select only the employees not matching themselves and return the outcome
Table.SelectRows(expanded, each [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
),
// Get the Employee_id and Employee Names into the departments table
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}, {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each [Employee_Id] <> null)
in
#"Filtered Rows"
Please mark this as solution if you can use it (even if it only helps you to decide NOT to use PowerQuery for the fuzzy matching...)
Thanks @PwerQueryKees
This solution could be close match but not exactly matching results to Excel Fuzzy Lookup Add-In Results (as I was looking for addin solution to be replaced).
Sorry for the late response, I was away for couple of days from my work.
I tried to use the above code copied into 'Advanced Editor' but its been giving multiple errors i.e., starting from 'Added Custom' onwards.
The code itself is solid: I tested it, but it may not match the context you are running it in.
The code assumes a query named #"Employee Data" with the structure as follows below.
Adapt the code to match your own table name and field names as neccesary.
If that is not the problem, then share the error messages.
Thanks @PwerQueryKees
I'm not sure how to attach a file. Hence attaching a screenshot with sample data.
Input sheet:
Excel Fuzzy Lookup Add-in configuration (Looking for simlar configurations in Power query i.e., Exact Match = Department ID, Fuzzy Match = Employee Name)
Ouput columns expected:
And I forgot to mention: My (artificial) test data produced 3063 matches, with Threshold=0.9 and NumberOfMatches=1 (you always get only 1 match per employee).
When dealing with large datasets, using fuzzy merge can become time-consuming. Before opting for fuzzy merge, consider extracting the similar parts directly, as this may simplify the process and solve the problem in one step. If possible, please provide a sample of the data so I can offer more tailored assistance.
Additionally, you can use Table.Buffer to improve filtering performance, though its effectiveness also depends on your data structure.
Could you please be more detailed what exactly you mean "consider extracting the similar parts directly, as this may simplify the process"?
Are you saying run the process in batch wise for group of dempartment_ids?
Best Regards,
Vinod
Copy the following code and past it in advance editor.
let
Table1 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WilDSUTI01DUEUUqxOnABIyBlBBaIRKgwBgtEIVSYKMXGAgA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Department_Id = _t, Employee_Name = _t, Column1 = _t]),
Table2 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WilDSUTI0NFSK1YGxjcDsSCTxKJh4LAA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Department_Id = _t, Employee_Name = _t]),
#"Merged Queries" = Table.FuzzyNestedJoin(Table2, {"Employee_Name"}, Table1, {"Employee_Name"}, "Table2", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true]),
#"Added Custom" = Table.AddColumn(#"Merged Queries", "Custom", each Table.SelectRows([Table2], (x)=> x[Department_Id]=[Department_Id]))
in
#"Added Custom"
This is the first solution I mentino.
If this answer helped resolve your issue, please consider marking it as the accepted answer. And if you found my response helpful, I'd appreciate it if you could give me kudos. Thank you!