March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi everyone,
I need to write a very specific deduplication query and I am wondering if you can help me out. We have an ecommerce site and I am getting a daily CSV file that has transactional information for each day, and I have built a few Power BI reports based on that dataset. Over time, in a certain situation, parts of a transaction can get double counted. So here's what I am trying to do.
Within a Master Order Number, I want to check for duplicate Item IDs. If an Item ID is duplicated within a Master Order Number, its sales should NOT be counted IF the Invoice Date is blank. So in the data below, the rows I highlighted should not be counted.
I am scratching my head on figuring out how to do this, so any insight or suggestions are appreciated! I have created a RANKX function that looks within a Master Order # and will rank the ITEM IDs. But, I am not sure what to do with that info because I'd still have to write a query that checks if the "Check for dupes" column is equal, and then find the Item ID that has a blank Invoice Number. So I don't think that has done me any good.
Thanks in advance for your wisdom!
Solved! Go to Solution.
Hi @Anonymous
Try this code.
Remove Duplicate =
SUMMARIZE (
FILTER (
'Table',
IF (
VAR _COUNTROW =
CALCULATE (
COUNTROWS ( 'Table' ),
FILTER (
ALL ( 'Table' ),
'Table'[Master Order Number] = EARLIER ( 'Table'[Master Order Number] )
&& 'Table'[Item ID] = EARLIER ( 'Table'[Item ID] )
)
)
RETURN
_COUNTROW > 1,
'Table'[RANK] <> 1
|| 'Table'[Invoice Date] <> BLANK (),
'Table'[RANK] = 1
)
),
'Table'[Order Number],
'Table'[Master Order Number],
'Table'[Order Date],
'Table'[Invoice Date],
'Table'[Item ID]
)
Result is as below.
Best Regards,
Rico Zhou
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Deduplication is much easier to do in Power Query (unless it depends on user filters).
Please provide sanitized sample data that fully covers your issue. Paste the data into a table in your post or use one of the file services. Please show the expected outcome.
@lbendlin thanks for the reply.
Table 1: Sample of data before filter showing all the order IDs within Master Order # 10034858.
Order Number | Master Order Number | Order Date | Invoice Date | Item ID |
10034858 | 10034858 | 7/12/2021 | CA-ST-EA | |
10034858 | 10034858 | 7/12/2021 | CA-BB-CS | |
10034858 | 10034858 | 7/12/2021 | CC-USR-04 | |
10034858 | 10034858 | 7/12/2021 | CC-UPTP-04 | |
20405998 | 10034858 | 7/12/2021 | FP-20EHH-CS | |
20405999 | 10034858 | 7/12/2021 | 3M-SGCUP-01 | |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CC-USR-04 |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CC-UPTP-04 |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CA-ST-EA |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CA-BB-CS |
20405998 | 10034858 | 7/12/2021 | 7/13/2021 | FP-20EHH-CS |
20405999 | 10034858 | 7/12/2021 | 7/13/2021 | 3M-SGCUP-01 |
Table 2: Data after filter applied; duplicate order IDs within Master Order # 10034858 WHERE Invoice Date = NULL have been removed.
(Remember I want this logic to look within all Master Order #s. I have just filtered down to one Master Order # to simplify the example output.)
Order Number | Master Order Number | Order Date | Invoice Date | Item ID |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CC-USR-04 |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CC-UPTP-04 |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CA-ST-EA |
10034858 | 10034858 | 7/12/2021 | 7/13/2021 | CA-BB-CS |
20405998 | 10034858 | 7/12/2021 | 7/13/2021 | FP-20EHH-CS |
20405999 | 10034858 | 7/12/2021 | 7/13/2021 | 3M-SGCUP-01 |
Thanks again.
Hi @Anonymous
Try this rank code.
RANK =
RANKX (
FILTER (
'Table',
'Table'[Order Number] = EARLIER ( 'Table'[Order Number] )
&& 'Table'[Master Order Number] = EARLIER ( 'Table'[Master Order Number] )
&& 'Table'[Item ID] = EARLIER ( 'Table'[Item ID] )
&& 'Table'[Order Date] = EARLIER ( 'Table'[Order Date] )
),
'Table'[Invoice Date],
,
ASC
)
This code will create a rank based on each [Order Number],[Master Order Number],[Order Date] and [Item ID] group.
Then create a new dax table to remove rank =1.
Remove Duplicate =
SUMMARIZE(FILTER('Table','Table'[RANK]<>1),'Table'[Order Number],'Table'[Master Order Number],'Table'[Order Date],'Table'[Invoice Date],'Table'[Item ID])
Result is as below.
Or you can calcualte sum directly by add rank<>1 in your table filter. You don't need to create a new table.
Such as:
Measure = CALCULATE(SUM([Sales],Filter('Table',[Rank]<>1))
Best Regards,
Rico Zhou
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @v-rzhou-msft, thank you so much for the reply and for taking the time to help me out. I think this gets me closer but it's not quite what I need. I may not have explained it very clearly, but I only want to suppress rows when the following conditions are met:
So consider the screenshot below. The ones highlighted in green should be removed, and the white ones left alone, which your script did perfectly.
But the yellow ones I highlighted, which are part of a different Master Order #, should not be removed because they do not have duplicate Item IDs nor do they have a blank Invoice Date. So I don't want to remove the yellow rows. Does this help clarify what I am asking for?
Hi @Anonymous
Add a filter in calculated table code.
Remove Duplicate =
SUMMARIZE (
FILTER ( 'Table', OR ( 'Table'[RANK] <> 1, 'Table'[Invoice Date] <> BLANK () ) ),
'Table'[Order Number],
'Table'[Master Order Number],
'Table'[Order Date],
'Table'[Invoice Date],
'Table'[Item ID]
)
Result is as below.
Best Regards,
Rico Zhou
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
@v-rzhou-msft -- even closer but one thing is still missing. I still want to count orders that were placed but not yet invoiced. Only when there were duplicate Item IDs within a Master Order # that had blank Invoice Dates did I want to discard them from the total.
Consider the Master Order # below, #10007106. It does not have an invoice date (yet) but does not have any duplicate Item IDs. In other words, it has not yet been invoiced but has not been double counted. It should still be counted in total sales (I created a separate metric for invoiced sales).
So how could I modify either the deduplication table or the rank column to account for this? Again, thanks for all your help and for sticking with me on this one.
Hi @Anonymous
Try this code.
Remove Duplicate =
SUMMARIZE (
FILTER (
'Table',
IF (
VAR _COUNTROW =
CALCULATE (
COUNTROWS ( 'Table' ),
FILTER (
ALL ( 'Table' ),
'Table'[Master Order Number] = EARLIER ( 'Table'[Master Order Number] )
&& 'Table'[Item ID] = EARLIER ( 'Table'[Item ID] )
)
)
RETURN
_COUNTROW > 1,
'Table'[RANK] <> 1
|| 'Table'[Invoice Date] <> BLANK (),
'Table'[RANK] = 1
)
),
'Table'[Order Number],
'Table'[Master Order Number],
'Table'[Order Date],
'Table'[Invoice Date],
'Table'[Item ID]
)
Result is as below.
Best Regards,
Rico Zhou
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Thank you very much @v-rzhou-msft this has done the trick. Appreciate your expertise and assistance.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
User | Count |
---|---|
90 | |
89 | |
85 | |
73 | |
49 |
User | Count |
---|---|
167 | |
148 | |
92 | |
72 | |
58 |