Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now
Good afternoon,
As the title suggests, I have combined two datasets based on a shared commonality (in this case account id #).
Now that the datasets are merged, I have columns from the first dataset and columns from the second dataset.
I am wanting to see whether I have duplicates in the rows after combining the datasets.
For example,
say dataset one has the following columns - Account_Id | $amount | state | city
and dataset two has the following columns - Id | AMOUNT | state | county
then after merging the two based on Account_Id, I'm supposed to have the following table in query:
Account_Id | $amount| state | city | AMOUNT | state | county
Obviously, it's easy to see the duplicates in such a small array. But my dataset is very large.
Any suggestions?
Thank you!
One way is this:
Those are your dupes. You can then either expand the "All Rows" column to see the dupe, or just click on the field and it will show the duplications in the bottom of the Power Query window.
Of course delete all of the grouping and stuff after done - this is just for analysis.
You can quickly see if there are even any dupes by using the Transform Tab, Count Rows feature before and after the merge. If the record count is the same, there are no dupes. If it is different though you will know you have dupes but no clue where, so you can then do the above steps.
You could do a Distinct on the Index after the merge, but you cannot guarantee which records will be kept/removed. It will keep one and discard the others.
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI ReportingThank you for your quick reply and suggestion!
I think I understand what you suggested, but I am not sure if it allows me to achieve my desired outcome; I think I might have not explained it properly (my bad).
When I merged the two data sets, I am expected to see duplicate columns. Like in the example in my first message, I will see something like Account_Id | $amount| state | city | AMOUNT | state | county . In this case the $amonut and AMOUNT columns are the duplicates, where I am expecting to see the same values in each row.
I want to find these duplicates ^. Doing what you suggested, I got a count of '1' across the board, but I'm not sure how to intrepret it. I know that one dataset has 206 columns while the other 198. I'm trying to find the distinct ones and I expected to see 198 duplicates and 8 distinct.
Hopefully I'm explaning this clearly (?)
Thanks again
I'm not sure how you could easily program that kind of duplicate detection. If you had Amounts and Amounts.1, which is what happens when you merge two tables with the same column names then it wouldn't be too hard. You could use the logic above once you convert the column names to a field. Table.ColumnNames() can do this.
For example, I expanded Date and DateKey from another table in to my Date table and it created duplicates with the .1 after them. Using Group By with the All Rows logic, I can see I have Date and Date.1
But to do that you need a common name - I used the Text.BeforeDelimiter() on the column to get everything before the period. But you cannot do that if your columns are $amount and AMOUNT unless it is consistently like that, and you can get the text after the $ sign, and then upper case all of them. Power Query is case sensitive, so Amount, AMOUNT, and amount are 3 different values. (Those are just one in DAX).
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI ReportingThanks for clarifying that query is case sensitive. I am pretty new to it, and I have figured out it'd be a little complicated to locate/identify all the duplicates - given that they don't have the same column names.
I guess a follow up question would be what does someone in this situation do in order to identify a large number of duplicates? Going over each and every column manually is not feasible.
The key to a well performing model is few columns. The Vertipaq engine in Power BI loves very skinny, very tall tables. If you have too many columns to evaluate the issues with duplicates, then your main problem right now isn't duplicates. It is too many columns period.
I will spend hours (days!) ripping out columns out of a model to get more performance. The more columns you have the worse compression is, the larger your data dictionaries are, etc. I'm working with a model right now I dislike intensely because of how many columns the main fact table has, and it is only 24 columns. I really want to refactor it by doing some pivot operations and make the table taller and narrower, which I can do because of how the measures need to work, but I just need to find the time to do it.
DAX is for Analysis. Power Query is for Data Modeling
Proud to be a Super User!
MCSA: BI ReportingCheck out the October 2024 Power BI update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
67 | |
61 | |
23 | |
17 | |
12 |