Solved: Fuzzy Merge - With Two Columns (one with exacth ma...

vinod_excel · ‎10-29-2024

Hi All,

I've single table (Employee) to perform Fuzzy Merge/Match on Employee_Name within department_id, so that I can see matching / duplicate employee names side by side within same department.

Here is the table structure.

Department_Id
Employee_Id
Employee_Name

How to perform exatch match on 'Department_Id' and fuzzy match on Employee_Name?

Best Regards,

Vinod

PwerQueryKees · ‎11-09-2024

So I understood correctly that you have 100K employees.
But he size of the departments (companies) is 10 times as large as I assumed.

So I tested with 100K employees in 288 departments with on average around 350 emplyees each. I used a normal ditribution for the number of employees per department so that 90% is between 300 and 400 employees.

And I have some good news for you!

It is even quicker! It is now about 1 minute on my laptop.

So the performace of the FuzzyJoin is not lineair with the number of comparisons it needs to make, but gets relatively more efficient with a bigger dataset.

Here the final query:

let
    Source = #"Employee Data",
    #"Grouped Rows" = Table.Group(Source, {"Department_Id"}, {{"Employees", each _, type table [Department_Id=nullable number, Employee_Id=nullable number, Employee_Name=nullable text]}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each // the following executes for each row.
            let // to break up a otherwise complex fromula (does not impact perfomance) 
                // load this department employees in memory
                employee_buffer = [Employees], // Table.Buffer([Employees]),  // tried buffering, but did not make any difference
                // Create a table with matching employee names
                Matched = Table.FuzzyNestedJoin(employee_buffer, {"Employee_Name"}, employee_buffer, {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.9, NumberOfMatches=1]),
                // Add the Employee Id's and Employee Names of the matches
                expanded = Table.ExpandTableColumn(Matched, "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"})
            in
                // Select only the employees not matching themselves and return the outcome
                Table.SelectRows(expanded, each [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
        ),
    // Get the Employee_id and Employee Names into the departments table
    #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}, {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each [Employee_Id] <> null)
in
    #"Filtered Rows"

Please mark this as solution if you can use it (even if it only helps you to decide NOT to use PowerQuery for the fuzzy matching...)

View solution in original post

PwerQueryKees · ‎10-30-2024

At my laptop now...

Starting with

Here the M-Code (all but the final filter condition completly done with the UI

let
    Source = Excel.CurrentWorkbook(){[Name="Employee"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Department_Id", Int64.Type}, {"Employee_Id", Int64.Type}, {"Employee_Name", type text}}),
    #"Merged Queries1" = Table.FuzzyNestedJoin(#"Changed Type", {"Employee_Name"}, #"Changed Type", {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.2]),
    #"Expanded Fuzzy Matched Employees" = Table.ExpandTableColumn(#"Merged Queries1", "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),

    // Result includes matches accross depratments and each employee always matches itself. 
    // Select all rows where De Department is the same as the matched departement and the empoyee id differes from the machted employee
    #"Filtered Rows" = Table.SelectRows(#"Expanded Fuzzy Matched Employees", each [Fuzzy Matched Employees.Department_Id] = [Department_Id] and [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
in
    #"Filtered Rows"

Resulting in only 3 matches found.

play around with the Threshold of the FuzzyNestedJoin to get more or fewer matches....

Anonymous · ‎10-29-2024

Hi @vinod_excel ,

Power Query does not support exact and fuzzy matching at the same time. You must either perform exact matching on both columns or perform fuzzy matching on both columns.
Please try the solutions provided by @Omid_Motamedise and @PwerQueryKees , If it works, please accept their response as the solution. If this doesn't solve your problem, please provide some sample data and the corresponding expected results so that we can try to find a workaround for you.

Best Regards,
Dino Tao

Omid_Motamedise · ‎10-29-2024

To do this, you need to apply the fuzzy merge based on Employee_Name only, then in the resulted rows, filter (by using Table.SelectRows) the resulted rows with the same Department_Id.

If this answer helped resolve your issue, please consider marking it as the accepted answer. And if you found my response helpful, I'd appreciate it if you could give me kudos.
Thank you!

If my answer helped solve your issue, please consider marking it as the accepted solution.

vinod_excel · ‎10-30-2024

Thanks @Omid_Motamedise

I was thinking about same solution but my data is huge (100K+ rows) and concenred about performance issue with fuzzy merge based on Employee_Name only but not by Department_Id wise.

Step#1: Initial process takes time to merge across all departments

Step#2: Each row comparison to check for department ids are same or not

Step#3: Filterout rows where department ids are not same

Step#4: Final output filter out rows with same employee id within each row

Also I'm expecting results to match with Excel Fuzzy Lookup Add-In output (department_id - exacth match, employee_name - default).

Best Regards,

Vinod

PwerQueryKees · ‎11-01-2024

At my laptop again.

I took a different approach now.

I first group the data in Employees by Department_ID with "keep all rows"
This produces a Column Employees
Then I create a new colum containing the fuzzy match of all employees of that department only
I then in each fuzzy matched table i remove all matches of employees to themselves
Finaly I remove all EDepartments with any matches

I tested it with a 100000 row Employee table.
This loads at about 60 rows per second. My test datset has 25 people in each department. Making departments larger or smaller may give different results.

My original query loaded a 6 rows/sec. So about a 10-fold speed improvement.

Here the M Code.

let
    Source = #"Employee Data",
    #"Grouped Rows" = Table.Group(Source, {"Department_Id"}, {{"Employees", each _, type table [Department_Id=nullable number, Employee_Id=nullable number, Employee_Name=nullable text]}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each // the following executes for each row.
            let // to break up a otherwise complex fromula (does not impact perfomance) 
                // Create a table with matching employee names
                Matched = Table.FuzzyNestedJoin([Employees], {"Employee_Name"}, [Employees], {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.2]),
                // Add the Employee Id's and Employee Names of the matches
                expanded = Table.ExpandTableColumn(Matched, "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"})
            in
                // Select only the employees not matching themselves and return the outcome
                Table.SelectRows(expanded, each [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
        ),
    // Get the Employee_id and Employee Names into the departments table
    #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}, {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each [Employee_Id] <> null)
in
    #"Filtered Rows"

PwerQueryKees · ‎11-01-2024

Total load time for 8069 matches was 2:35. So on average 52 matches per second.

vinod_excel · ‎11-05-2024

Thanks @PwerQueryKees

2:35 minutes is huge time for 8K matches. Mine is 80K+ records so it would be 10 times to process and the data keep on growing. So in current volume it takes around 20+ minutes.

PwerQueryKees · ‎11-06-2024

EDIT: Halving the number of matches is next to impossible because you can't controll the FuzzyNestedJoin() to consider only a part of the records.

I understood from your ealier message that your total datset was 100K rows. So 8K matches seemed reasonable.

If you have 80K duplicate names out of 100K this means (almost) every employee has a duplicate.

Anyway...

The speed of my solution scale lineair with the number of departments and quadratic with the (average) number of employees per department. The solution @Omid_Motamedise gave you is quadratic with the total number of employees.

So Omid's appraoch requires 100K * 100K = 10 miljon matches,

My approach requires 4000 * 25 * 25 = 4K * 225 = 900K matches

And keep in mind I tested on my laptop. Your hardware maybe be more capable ...

You could possible cut the times in half again if you do every match only once. So if you match E1 with E2 you do not need to match E2 to E1. So a bit over a minute.

I would add an index to each departments employee table and only match each employees name with the employees having a higher index.
If you want the matchin name to appear on both employees, you would need to do that at the end, adding some additional time...

No time at the moment to work that one out for you...

But if

you specify the exact number of departments and
the average and stddeviation of the number of employees in each department (although I suspect your actual table does not contains employees 😁) and
you have a target elapse time for me,

I could take a swing at it. Although I would be really surprised (and proud) to get it below 1 minute 15 secs.

vinod_excel · ‎11-07-2024

@PwerQueryKees Thanks again for taking time to elaborate the solution.

Your suspect is not correct, I've employee names but in place of department_id I've company_ids.

I need to deal with 300+ company_ids and each company may have multiple employee_ids created for same employee with some text change in their name from time to time...The change may be at positions of first name, last name, middle name, first & last name combined without space or given with comma, earlier entry might be having achronym but latest name of same employee name provided with fullly abrivated, etc...

So I need to check for duplicate employee names within company_id and remap latest employee_id with old employee_id or vice versa.

Hope I made it clear with details that I was looking.

PwerQueryKees · ‎11-07-2024

That clarifies your use case somewhat. So 300 companies. How many employees per compny on average? And are they very much of similar size or is there a lot of variation? What is the range of sizes for 90% of you companies? (So, ignoring the extremes?)

Keep in mind that 2.5 minutes to match a total of 100K employees is not bad. Even if you need to do this daily.

Let me know if you are still interested in a solution or maybe the soltion I gave you is satisfactory? Please mark it as solution in that case....

vinod_excel · ‎11-08-2024

Hi @PwerQueryKees

It's hard to say average employees in each company, as it will not be consistent always as data processing/reconsilation happens from multiple data sources. If I've to say for sure then consider the average would be 600 to 700+ employees.. some times it will be 2 digit in few companies.

Still I'm not confident on 2.5 minutes processing time for 100K employees.

PwerQueryKees · ‎11-09-2024

So I understood correctly that you have 100K employees.
But he size of the departments (companies) is 10 times as large as I assumed.

So I tested with 100K employees in 288 departments with on average around 350 emplyees each. I used a normal ditribution for the number of employees per department so that 90% is between 300 and 400 employees.

And I have some good news for you!

It is even quicker! It is now about 1 minute on my laptop.

So the performace of the FuzzyJoin is not lineair with the number of comparisons it needs to make, but gets relatively more efficient with a bigger dataset.

Here the final query:

let
    Source = #"Employee Data",
    #"Grouped Rows" = Table.Group(Source, {"Department_Id"}, {{"Employees", each _, type table [Department_Id=nullable number, Employee_Id=nullable number, Employee_Name=nullable text]}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Custom", each // the following executes for each row.
            let // to break up a otherwise complex fromula (does not impact perfomance) 
                // load this department employees in memory
                employee_buffer = [Employees], // Table.Buffer([Employees]),  // tried buffering, but did not make any difference
                // Create a table with matching employee names
                Matched = Table.FuzzyNestedJoin(employee_buffer, {"Employee_Name"}, employee_buffer, {"Employee_Name"}, "Fuzzy Matched Employees", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=false, Threshold=0.9, NumberOfMatches=1]),
                // Add the Employee Id's and Employee Names of the matches
                expanded = Table.ExpandTableColumn(Matched, "Fuzzy Matched Employees", {"Department_Id", "Employee_Id", "Employee_Name"}, {"Fuzzy Matched Employees.Department_Id", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"})
            in
                // Select only the employees not matching themselves and return the outcome
                Table.SelectRows(expanded, each [Fuzzy Matched Employees.Employee_Id] <> [Employee_Id])
        ),
    // Get the Employee_id and Employee Names into the departments table
    #"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}, {"Employee_Id", "Employee_Name", "Fuzzy Matched Employees.Employee_Id", "Fuzzy Matched Employees.Employee_Name"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Custom", each [Employee_Id] <> null)
in
    #"Filtered Rows"

Please mark this as solution if you can use it (even if it only helps you to decide NOT to use PowerQuery for the fuzzy matching...)

vinod_excel · ‎11-29-2024

Thanks @PwerQueryKees

This solution could be close match but not exactly matching results to Excel Fuzzy Lookup Add-In Results (as I was looking for addin solution to be replaced).

vinod_excel · ‎11-19-2024

@PwerQueryKees

Sorry for the late response, I was away for couple of days from my work.

I tried to use the above code copied into 'Advanced Editor' but its been giving multiple errors i.e., starting from 'Added Custom' onwards.

PwerQueryKees · ‎11-19-2024

The code itself is solid: I tested it, but it may not match the context you are running it in.

The code assumes a query named #"Employee Data" with the structure as follows below.

Adapt the code to match your own table name and field names as neccesary.

If that is not the problem, then share the error messages.

vinod_excel · ‎11-27-2024

Thanks @PwerQueryKees

I'm not sure how to attach a file. Hence attaching a screenshot with sample data.

Input sheet:

Excel Fuzzy Lookup Add-in configuration (Looking for simlar configurations in Power query i.e., Exact Match = Department ID, Fuzzy Match = Employee Name)

Ouput columns expected:

PwerQueryKees · ‎11-09-2024

And I forgot to mention: My (artificial) test data produced 3063 matches, with Threshold=0.9 and NumberOfMatches=1 (you always get only 1 match per employee).

Omid_Motamedise · ‎10-30-2024

When dealing with large datasets, using fuzzy merge can become time-consuming. Before opting for fuzzy merge, consider extracting the similar parts directly, as this may simplify the process and solve the problem in one step. If possible, please provide a sample of the data so I can offer more tailored assistance.

Additionally, you can use Table.Buffer to improve filtering performance, though its effectiveness also depends on your data structure.

If my answer helped solve your issue, please consider marking it as the accepted solution.

vinod_excel · ‎10-31-2024

Hi @Omid_Motamedise

Could you please be more detailed what exactly you mean "consider extracting the similar parts directly, as this may simplify the process"?

Are you saying run the process in batch wise for group of dempartment_ids?

Best Regards,

Vinod

Omid_Motamedise · ‎11-05-2024

Copy the following code and past it in advance editor.

let
    Table1 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WilDSUTI01DUEUUqxOnABIyBlBBaIRKgwBgtEIVSYKMXGAgA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Department_Id = _t, Employee_Name = _t, Column1 = _t]),
    Table2 = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WilDSUTI0NFSK1YGxjcDsSCTxKJh4LAA=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Department_Id = _t, Employee_Name = _t]),
    #"Merged Queries" = Table.FuzzyNestedJoin(Table2, {"Employee_Name"}, Table1, {"Employee_Name"}, "Table2", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true]),
    #"Added Custom" = Table.AddColumn(#"Merged Queries", "Custom", each Table.SelectRows([Table2], (x)=> x[Department_Id]=[Department_Id]))
    
in
    #"Added Custom"

This is the first solution I mentino.
If this answer helped resolve your issue, please consider marking it as the accepted answer. And if you found my response helpful, I'd appreciate it if you could give me kudos. Thank you!

If my answer helped solve your issue, please consider marking it as the accepted solution.