Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
DMZ7AAC
New Member

How to Easily Locate Duplicates in a Row After Merging Two Datasets ?

Good afternoon,

 

As the title suggests, I have combined two datasets based on a shared commonality (in this case account id #).

Now that the datasets are merged, I have columns from the first dataset and columns from the second dataset.

 

I am wanting to see whether I have duplicates in the rows after combining the datasets.

 

For example,

say dataset one has the following columns - Account_Id | $amount | state | city 

and dataset two has the following columns - Id | AMOUNT | state | county

then after merging the two based on Account_Id, I'm supposed to have the following table in query:

 

Account_Id | $amount| state | city | AMOUNT | state | county

 

Obviously, it's easy to see the duplicates in such a small array. But my dataset is very large.

 

Any suggestions?

 

Thank you!

5 REPLIES 5
edhans
Super User
Super User

One way is this:

  1. Add an Index column to the first table before you merge
  2. Do the merge and expansion
  3. Group by on the Index column with two aggregations:
    1. A Count Rows (the default on a group)
    2. add an All Rows aggregation.
  4. Filter that 3 column table to only show where the Count > 1.

Those are your dupes. You can then either expand the "All Rows" column to see the dupe, or just click on the field and it will show the duplications in the bottom of the Power Query window.

 

Of course delete all of the grouping and stuff after done - this is just for analysis.

 

You can quickly see if there are even any dupes by using the Transform Tab, Count Rows feature before and after the merge. If the record count is the same, there are no dupes. If it is different though you will know you have dupes but no clue where, so you can then do the above steps.

 

You could do a Distinct on the Index after the merge, but you cannot guarantee which records will be kept/removed. It will keep one and discard the others.



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

Thank you for your quick reply and suggestion!

I think I understand what you suggested, but I am not sure if it allows me to achieve my desired outcome; I think I might have not explained it properly (my bad).

 

When I merged the two data sets, I am expected to see duplicate columns. Like in the example in my first message, I will see something like Account_Id | $amount| state | city | AMOUNT | state | county . In this case the $amonut and AMOUNT columns are the duplicates, where I am expecting to see the same values in each row.

 

I want to find these duplicates ^.  Doing what you suggested, I got a count of '1' across the board, but I'm not sure how to intrepret it. I know that one dataset has 206 columns while the other 198. I'm trying to find the distinct ones and I expected to see 198 duplicates and 8 distinct. 

 

Hopefully I'm explaning this clearly (?)

 

Thanks again

I'm not sure how you could easily program that kind of duplicate detection. If you had Amounts and Amounts.1, which is what happens when you merge two tables with the same column names then it wouldn't be too hard. You could use the logic above once you convert the column names to a field. Table.ColumnNames() can do this.

 

For example, I expanded Date and DateKey from another table in to my Date table and it created duplicates with the .1 after them. Using Group By with the All Rows logic, I can see I have Date and Date.1

 

edhans_0-1707844980310.png

 

But to do that you need a common name - I used the Text.BeforeDelimiter() on the column to get everything before the period. But you cannot do that if your columns are $amount and AMOUNT unless it is consistently like that, and you can get the text after the $ sign, and then upper case all of them. Power Query is case sensitive, so Amount, AMOUNT, and amount are 3 different values. (Those are just one in DAX).

 



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

Thanks for clarifying that query is case sensitive. I am pretty new to it, and I have figured out it'd be a little complicated to locate/identify all the duplicates - given that they don't have the same column names. 

 

I guess a follow up question would be what does someone in this situation do in order to identify a large number of duplicates? Going over each and every column manually is not feasible. 

The key to a well performing model is few columns. The Vertipaq engine in Power BI loves very skinny, very tall tables. If you have too many columns to evaluate the issues with duplicates, then your main problem right now isn't duplicates. It is too many columns period.

I will spend hours (days!) ripping out columns out of a model to get more performance. The more columns you have the worse compression is, the larger your data dictionaries are, etc. I'm working with a model right now I dislike intensely because of how many columns the main fact table has, and it is only 24 columns. I really want to refactor it by doing some pivot operations and make the table taller and narrower, which I can do because of how the measures need to work, but I just need to find the time to do it.



Did I answer your question? Mark my post as a solution!
Did my answers help arrive at a solution? Give it a kudos by clicking the Thumbs Up!

DAX is for Analysis. Power Query is for Data Modeling


Proud to be a Super User!

MCSA: BI Reporting

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors