Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now

Reply
Anonymous
Not applicable

Conditional deduplication based on multiple columns

Hi everyone,

 

I need to write a very specific deduplication query and I am wondering if you can help me out. We have an ecommerce site and I am getting a daily CSV file that has transactional information for each day, and I have built a few Power BI reports based on that dataset. Over time, in a certain situation, parts of a transaction can get double counted. So here's what I am trying to do.

 

Within a Master Order Number, I want to check for duplicate Item IDs. If an Item ID is duplicated within a Master Order Number, its sales should NOT be counted IF the Invoice Date is blank. So in the data below, the rows I highlighted should not be counted.

 

2021-11-04_10-51-34.png

 

I am scratching my head on figuring out how to do this, so any insight or suggestions are appreciated! I have created a RANKX function that looks within a Master Order # and will rank the ITEM IDs. But, I am not sure what to do with that info because I'd still have to write a query that checks if the "Check for dupes" column is equal, and then find the Item ID that has a blank Invoice Number. So I don't think that has done me any good.

 

2021-11-04_11-04-26.png

 

Thanks in advance for your wisdom!

1 ACCEPTED SOLUTION

Hi @Anonymous 

Try this code.

Remove Duplicate =
SUMMARIZE (
    FILTER (
        'Table',
        IF (
            VAR _COUNTROW =
                CALCULATE (
                    COUNTROWS ( 'Table' ),
                    FILTER (
                        ALL ( 'Table' ),
                        'Table'[Master Order Number] = EARLIER ( 'Table'[Master Order Number] )
                            && 'Table'[Item ID] = EARLIER ( 'Table'[Item ID] )
                    )
                )
            RETURN
                _COUNTROW > 1,
            'Table'[RANK] <> 1
                || 'Table'[Invoice Date] <> BLANK (),
            'Table'[RANK] = 1
        )
    ),
    'Table'[Order Number],
    'Table'[Master Order Number],
    'Table'[Order Date],
    'Table'[Invoice Date],
    'Table'[Item ID]
)

Result is as below.

1.png

Best Regards,
Rico Zhou

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

8 REPLIES 8
lbendlin
Super User
Super User

Deduplication is much easier to do in Power Query (unless it depends on user filters).

 

Please provide sanitized sample data that fully covers your issue. Paste the data into a table in your post or use one of the file services. Please show the expected outcome.

Anonymous
Not applicable

@lbendlin thanks for the reply.

 

Table 1: Sample of data before filter showing all the order IDs within Master Order # 10034858.

 

Order NumberMaster Order NumberOrder DateInvoice DateItem ID
10034858100348587/12/2021 CA-ST-EA
10034858100348587/12/2021 CA-BB-CS
10034858100348587/12/2021 CC-USR-04
10034858100348587/12/2021 CC-UPTP-04
20405998100348587/12/2021 FP-20EHH-CS
20405999100348587/12/2021 3M-SGCUP-01
10034858100348587/12/20217/13/2021CC-USR-04
10034858100348587/12/20217/13/2021CC-UPTP-04
10034858100348587/12/20217/13/2021CA-ST-EA
10034858100348587/12/20217/13/2021CA-BB-CS
20405998100348587/12/20217/13/2021FP-20EHH-CS
20405999100348587/12/20217/13/20213M-SGCUP-01

 

Table 2: Data after filter applied; duplicate order IDs within Master Order # 10034858 WHERE Invoice Date = NULL have been removed.

(Remember I want this logic to look within all Master Order #s. I have just filtered down to one Master Order # to simplify the example output.)

 

Order NumberMaster Order NumberOrder DateInvoice DateItem ID
10034858100348587/12/20217/13/2021CC-USR-04
10034858100348587/12/20217/13/2021CC-UPTP-04
10034858100348587/12/20217/13/2021CA-ST-EA
10034858100348587/12/20217/13/2021CA-BB-CS
20405998100348587/12/20217/13/2021FP-20EHH-CS
20405999100348587/12/20217/13/20213M-SGCUP-01

 

Thanks again.

Hi @Anonymous 

Try this rank code.

 

RANK = 
RANKX (
    FILTER (
        'Table',
        'Table'[Order Number] = EARLIER ( 'Table'[Order Number] )
            && 'Table'[Master Order Number] = EARLIER ( 'Table'[Master Order Number] )
            && 'Table'[Item ID] = EARLIER ( 'Table'[Item ID] )
            && 'Table'[Order Date] = EARLIER ( 'Table'[Order Date] )
    ),
    'Table'[Invoice Date],
    ,
    ASC
)

 

This code will create a rank based on each [Order Number],[Master Order Number],[Order Date] and [Item ID] group. 

1.png

Then create a new dax table to remove rank =1.

 

Remove Duplicate = 
SUMMARIZE(FILTER('Table','Table'[RANK]<>1),'Table'[Order Number],'Table'[Master Order Number],'Table'[Order Date],'Table'[Invoice Date],'Table'[Item ID])

 

Result is as below.

1.png

Or you can calcualte sum directly by add rank<>1 in your table filter. You don't need to create a new table.

Such as:

Measure = CALCULATE(SUM([Sales],Filter('Table',[Rank]<>1))

 

Best Regards,
Rico Zhou

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous
Not applicable

Hi @v-rzhou-msft, thank you so much for the reply and for taking the time to help me out. I think this gets me closer but it's not quite what I need. I may not have explained it very clearly, but I only want to suppress rows when the following conditions are met:

 

  • There are duplicate Item IDs within a Master Order # AND those duplicate Item IDs contain a blank/NULL invoice date.

So consider the screenshot below. The ones highlighted in green should be removed, and the white ones left alone, which your script did perfectly.

 

But the yellow ones I highlighted, which are part of a different Master Order #, should not be removed because they do not have duplicate Item IDs nor do they have a blank Invoice Date. So I don't want to remove the yellow rows. Does this help clarify what I am asking for?

 

2021-11-09_9-19-05.png

Hi @Anonymous 

Add a filter in calculated table code.

Remove Duplicate =
SUMMARIZE (
    FILTER ( 'Table', OR ( 'Table'[RANK] <> 1, 'Table'[Invoice Date] <> BLANK () ) ),
    'Table'[Order Number],
    'Table'[Master Order Number],
    'Table'[Order Date],
    'Table'[Invoice Date],
    'Table'[Item ID]
)

Result is as below.

1.png

Best Regards,
Rico Zhou

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous
Not applicable

@v-rzhou-msft -- even closer but one thing is still missing. I still want to count orders that were placed but not yet invoiced. Only when there were duplicate Item IDs within a Master Order # that had blank Invoice Dates did I want to discard them from the total.

 

Consider the Master Order # below, #10007106. It does not have an invoice date (yet) but does not have any duplicate Item IDs. In other words, it has not yet been invoiced but has not been double counted. It should still be counted in total sales (I created a separate metric for invoiced sales).

 

So how could I modify either the deduplication table or the rank column to account for this? Again, thanks for all your help and for sticking with me on this one.

2021-11-10_11-35-21.png

 

Hi @Anonymous 

Try this code.

Remove Duplicate =
SUMMARIZE (
    FILTER (
        'Table',
        IF (
            VAR _COUNTROW =
                CALCULATE (
                    COUNTROWS ( 'Table' ),
                    FILTER (
                        ALL ( 'Table' ),
                        'Table'[Master Order Number] = EARLIER ( 'Table'[Master Order Number] )
                            && 'Table'[Item ID] = EARLIER ( 'Table'[Item ID] )
                    )
                )
            RETURN
                _COUNTROW > 1,
            'Table'[RANK] <> 1
                || 'Table'[Invoice Date] <> BLANK (),
            'Table'[RANK] = 1
        )
    ),
    'Table'[Order Number],
    'Table'[Master Order Number],
    'Table'[Order Date],
    'Table'[Invoice Date],
    'Table'[Item ID]
)

Result is as below.

1.png

Best Regards,
Rico Zhou

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous
Not applicable

Thank you very much @v-rzhou-msft this has done the trick. Appreciate your expertise and assistance.

Helpful resources

Announcements
OCT PBI Update Carousel

Power BI Monthly Update - October 2024

Check out the October 2024 Power BI update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

October NL Carousel

Fabric Community Update - October 2024

Find out what's new and trending in the Fabric Community.