The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
Hi I have a column called names with fuzzy duplicates and I am hoping to be able to group them together. I am aware of fuzzy matching in power bi but i do not have a list of correct names to merge the tables together.
For example:
name | group number |
Company Waterfall | 1 |
Company Waterfll | 1 |
Company Waterfalls | 1 |
Hello Centre | 0 |
A. R. Company | 2 |
AR Company | 2 |
Ideally those similar companies can be allocated an arbitrary group number or just any form of indication that separates each group.
I saw this potential solution from: https://www.mrexcel.com/board/threads/identify-fuzzy-matches-in-one-column.1211554/
but im not sure how i can do so in power bi! Appreciate any help possible 🙂 Thank you
Solved! Go to Solution.
Hi @kellyylx ,
Is the question solved? Try the following steps:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45Wcs7PLUjMq1QITyxJLUpLzMlRUIrVQRfOycEiCFRbDBb2SM3JyVdwTs0rKUoFCzjqKQTpKUBVQ0SCYFyg8bEA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [name = _t]),
#"Added Custom1" = Table.AddColumn(Source, "name2", each Text.Replace(Text.Replace([name],".","")," ",""),type text),
#"Merged Queries" = Table.FuzzyNestedJoin(#"Added Custom1", {"name2"}, #"Added Custom1", {"name2"}, "Source", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.5]),
#"Removed Other Columns" = Table.Distinct(Table.ReplaceValue(Table.SelectColumns(#"Merged Queries",{"Source"}),each [Source], each Table.Sort([Source],{{"name", Order.Ascending}}),Replacer.ReplaceValue,{"Source"})),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "count", each Table.RowCount([Source])),
#"Sorted Rows" = Table.Sort(#"Added Custom",{{"count", Order.Ascending}}),
#"Added Custom2" = Table.AddColumn(#"Sorted Rows", "number", each if [count] = 1 then 0 else 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom2", "Index", 1, 1, Int64.Type),
#"Added Custom3" = Table.AddColumn(#"Added Index", "group number", each List.Sum(List.FirstN(#"Added Custom2"[number],[Index]))),
#"Expanded Source" = Table.ExpandTableColumn(#"Added Custom3", "Source", {"name"}, {"name"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Source",{"number", "Index"})
in
#"Removed Columns"
Best Regards,
Gao
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
How to get your questions answered quickly -- How to provide sample data in the Power BI Forum
hi thanks so much for your help but is it possible to do something like if the row has no other duplicates, the row is in group 0. so group 0 are the unique rows and group 1 onwards are rows with duplicates?
You can check the count of items in each group and then bunch all the single items in a "Others" group but that would somehow defeat the purpose of the fuzzy match, no?
Hi, how can i go about counting the items in each group? I used group by index to count the number of rows for each index in another table but is unable to merge the 2 tables together.
this would be my ideal output:
name | group number | count |
Company Waterfall | 1 | 3 |
Company Waterfll | 1 | 3 |
Company Waterfalls | 1 | 3 |
Hello Centre | 0 | 1 |
A. R. Company | 2 | 2 |
AR Company | 2 | 2 |
Hi @kellyylx ,
Is the question solved? Try the following steps:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45Wcs7PLUjMq1QITyxJLUpLzMlRUIrVQRfOycEiCFRbDBb2SM3JyVdwTs0rKUoFCzjqKQTpKUBVQ0SCYFyg8bEA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [name = _t]),
#"Added Custom1" = Table.AddColumn(Source, "name2", each Text.Replace(Text.Replace([name],".","")," ",""),type text),
#"Merged Queries" = Table.FuzzyNestedJoin(#"Added Custom1", {"name2"}, #"Added Custom1", {"name2"}, "Source", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.5]),
#"Removed Other Columns" = Table.Distinct(Table.ReplaceValue(Table.SelectColumns(#"Merged Queries",{"Source"}),each [Source], each Table.Sort([Source],{{"name", Order.Ascending}}),Replacer.ReplaceValue,{"Source"})),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "count", each Table.RowCount([Source])),
#"Sorted Rows" = Table.Sort(#"Added Custom",{{"count", Order.Ascending}}),
#"Added Custom2" = Table.AddColumn(#"Sorted Rows", "number", each if [count] = 1 then 0 else 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom2", "Index", 1, 1, Int64.Type),
#"Added Custom3" = Table.AddColumn(#"Added Index", "group number", each List.Sum(List.FirstN(#"Added Custom2"[number],[Index]))),
#"Expanded Source" = Table.ExpandTableColumn(#"Added Custom3", "Source", {"name"}, {"name"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Source",{"number", "Index"})
in
#"Removed Columns"
Best Regards,
Gao
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
How to get your questions answered quickly -- How to provide sample data in the Power BI Forum
i have to filter out the potential duplicates thats why i wanted to have all the unique values in the same group.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45Wcs7PLUjMq1QITyxJLUpLzMlRUIrVQRfOycEiCFRbDBb2SM3JyVdwTs0rKUoFCzjqKQTpKUBVQ0SCYFyg8bEA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [name = _t]),
#"Merged Queries" = Table.FuzzyNestedJoin(Source, {"name"}, Source, {"name"}, "Source", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=0.5]),
#"Removed Other Columns" = Table.Distinct(Table.ReplaceValue(Table.SelectColumns(#"Merged Queries",{"Source"}),each [Source], each Table.Sort([Source],{{"name", Order.Ascending}}),Replacer.ReplaceValue,{"Source"})),
#"Added Index" = Table.AddIndexColumn(#"Removed Other Columns", "Index", 0, 1, Int64.Type),
#"Expanded Source" = Table.ExpandTableColumn(#"Added Index", "Source", {"name"}, {"name"})
in
#"Expanded Source"
How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done". Once you examined the code, replace the Source step with your own source.
You will need to play with the threshold to get a better match.
#"Removed Other Columns" = Table.Distinct(Table.ReplaceValue(Table.SelectColumns(#"Merged Queries",{"Source"}),each [Source], each Table.Sort([Source],{{"name", Order.Ascending}}),Replacer.ReplaceValue,{"Source"})),
Hi can you explain to me what this line does and why we need to have it?
User | Count |
---|---|
83 | |
83 | |
37 | |
34 | |
32 |
User | Count |
---|---|
92 | |
79 | |
62 | |
53 | |
51 |