Explore and share Fabric Notebooks to boost Power BI insights in the new community notebooks gallery.
Check it out now!Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more
Hi all,
I have a problem with one of my datasets I try to solve.
In my dataset I have the customerd IDs of all the customers I have reached out to. Unfortunately, in my dataset it looks like some customer IDs are shown 3 times. But I only want to show the first two occurences of a specific customer ID (based on date). How do I get this?
For example. My dataset looks like this.
ID | Date of mail |
A | 10-03-2020 |
B | 10-03-2020 |
C | 10-03-2020 |
A | 11-03-2020 |
B | 11-03-2020 |
C | 11-03-2020 |
A | 12-03-2020 |
B | 12-03-2020 |
In this table, I would like to delete the last A and B since it is the third time these customer IDs occur in my table. Does anyone know how to make this happen?
Best,
Jeroen
Solved! Go to Solution.
Hi @Anonymous ,
For the dax solution, we simplify the formula, could you please try to use the following one?
Filtered Table =
FILTER (
'Table',
VAR i = [ID]
VAR d = [Date of mail]
RETURN
CALCULATE ( DISTINCTCOUNT ( 'Table'[Condition] ), 'Table', 'Table'[ID] = i ) = 1
&& CALCULATE ( COUNTROWS ( 'Table' ), 'Table', 'Table'[ID] = i, 'Table'[Date of mail] <= d ) <= 2
)
If it still does not work, could you please try to verify how much time it will take if create a calculated column in the origin table?
Calculated Column =
VAR i = [ID]
VAR d = [Date of mail]
RETURN
CALCULATE ( DISTINCTCOUNT ( 'Table'[Condition] ), 'Table', 'Table'[ID] = i ) = 1
&& CALCULATE ( COUNTROWS ( 'Table' ), 'Table', 'Table'[ID] = i, 'Table'[Date of mail] <= d ) <= 2
For the Power Query Solution, what "query change is already up to 6 GB" mean ? If you mean the size of data is 6GB, when apply filter on the origin table, it will reduce the size of it. But when apply a dax solution, the origin table will keep 6 GB and generate a new calculated table with some data, for example 2GB if there are 4 GB filtered by dax, then the final size of dataset will be 6+2=8GB if use dax solution.
Best regards,
Hi @Anonymous ,
We can try to use Table.SelectRows function in Power Query Editor to meet your requirement:
Table.SelectRows(#"Changed Type",each let i = [ID],t = Table.Buffer(Table.SelectRows(#"Changed Type",each [ID] = i)),minDateList = Table.MinN(t,"Date of mail",2)
in Table.RowCount(Table.Distinct(Table.SelectColumns(t,{"Condition"})))=1 and Table.Contains(minDateList,_))
please change #"Changed Type" to the last step name of your query
All the queries are here:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WclTSUTI00DUw1jUyMDIAcZRidaKVnLALO2MXBhtiiN0QTGFndGEjhCFG2A1BE44FAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [ID = _t, #"Date of mail" = _t, Condition = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", type text}, {"Date of mail", type date}, {"Condition", Int64.Type}}),
FilterTable = Table.SelectRows(#"Changed Type",each let i = [ID],t = Table.Buffer(Table.SelectRows(#"Changed Type",each [ID] = i)),minDateList = Table.MinN(t,"Date of mail",2)
in Table.RowCount(Table.Distinct(Table.SelectColumns(t,{"Condition"})))=1 and Table.Contains(minDateList,_))
in
FilterTable
Best regards,
@v-lid-msft thanks for the solution. Due to the size of the different tables I rather fix this in DAX than in PowerQuery. Do you know how to fix this with DAX as well?
Best,
Jeroen
Hi @Anonymous ,
We can also create a calculated table by using following dax to meet your requirement, the second parameter can also be used to create a calculated column in the origin table.
Filtered Table =
FILTER (
'Table',
CALCULATE (
DISTINCTCOUNT ( 'Table'[Condition] ),
FILTER ( 'Table', 'Table'[ID] = EARLIER ( 'Table'[ID] ) )
) = 1
&& CALCULATE (
COUNTROWS ( 'Table' ),
FILTER (
'Table',
'Table'[ID] = EARLIER ( 'Table'[ID] )
&& 'Table'[Date of mail] <= EARLIER ( 'Table'[Date of mail] )
)
) <= 2
)
Best regards,
Hi @v-lid-msft ,
Thanks for your help. I'm getting close. However, I do not want to filter on "condition = 1", but I want to filter out customers where the conditions are not equal for the two occurences.
Do you know how I can change this?
Best,
Jeroen
Hi @Anonymous ,
The "=1" in the formula does not mean condition equal to 1, it means there are only one unique value in the condition for each id. Does the following example meet your requirement?
Best regards,
Hi @v-lid-msft ,
I see. My bad. Thanks for your help once again.
I tried to use the DAX formula, but after multiple hours I still saw the screen which said "Working on it". Therefore, I'm now trying to do this in PowerQuery, but now my query change is already up to 6 GB.
Any tips on this part?
Best,
Jeroen
Hi @Anonymous ,
For the dax solution, we simplify the formula, could you please try to use the following one?
Filtered Table =
FILTER (
'Table',
VAR i = [ID]
VAR d = [Date of mail]
RETURN
CALCULATE ( DISTINCTCOUNT ( 'Table'[Condition] ), 'Table', 'Table'[ID] = i ) = 1
&& CALCULATE ( COUNTROWS ( 'Table' ), 'Table', 'Table'[ID] = i, 'Table'[Date of mail] <= d ) <= 2
)
If it still does not work, could you please try to verify how much time it will take if create a calculated column in the origin table?
Calculated Column =
VAR i = [ID]
VAR d = [Date of mail]
RETURN
CALCULATE ( DISTINCTCOUNT ( 'Table'[Condition] ), 'Table', 'Table'[ID] = i ) = 1
&& CALCULATE ( COUNTROWS ( 'Table' ), 'Table', 'Table'[ID] = i, 'Table'[Date of mail] <= d ) <= 2
For the Power Query Solution, what "query change is already up to 6 GB" mean ? If you mean the size of data is 6GB, when apply filter on the origin table, it will reduce the size of it. But when apply a dax solution, the origin table will keep 6 GB and generate a new calculated table with some data, for example 2GB if there are 4 GB filtered by dax, then the final size of dataset will be 6+2=8GB if use dax solution.
Best regards,
Hi @v-lid-msft ,
I managed to make it work. Thanks for your help.
Regarding the "query change is already up to 6 GB" comment; after editing the query with the PowerQuery statements, my query turned out to be really big as of data size. Normally, if I change this query, it takes up to 1 GB and then the changes are done. But after the statements you provided, it took up to a minimum of 40 GB (and then I cancelled the changes).
But anyway, in the end I managed to make it work thanks to your help. Thanks once again.
Best,
Jeroen
Hi @Anonymous ,
Glad to hear that you have resolved your problem. If it does not contain any confidential information, could you kindly please share the improved code you used to filter the table so that it can benefit more users?
If you have any other questions about this scenario, please kindly ask here and we will try to resolve it.
Best regards,
Well, you could create a column like this:
Count = COUNTROWS(FILTER('Table',[ID] = EARLIER([ID]) && [Date of mail] <= EARLIER([Date of mail]))
Then you could create a new table and FILTER anything over 2 in Count.
@Greg_Deckler It worked! Thank you very much!
I have a subsequent question I hope you can help me with as well. My dataset has another error. Everyone occurs twice now, but each should have the same condition for both occurences (either 1 or 2). However, this is not the case. My data looks like this now:
ID | Date of mail | Condition |
A | 10-03-2020 | 1 |
B | 10-03-2020 | 1 |
C | 10-03-2020 | 1 |
A | 11-03-2020 | 1 |
B | 11-03-2020 | 1 |
C | 11-03-2020 | 2 |
In this case, I would like to delete C because the condition is not the same for both occurences. I want to keep A and B.
Do you know how to make this work in the same table as the one I have created using your previous solution?
I hope you are able to help.
Best,
Jeroen
@Anonymous ,
Hmm... Let's backup because I am not getting the same results as you. Have a look at the attached PBIX.
@Greg_Deckler
Thanks for the file. The error of the different conditions was already in the dataset, so it is not something new that has occured after your solution. It is just that I have found out that the error was there after your solution, but it has been there since the beginning. It's just an error in the dataset due to human errors.
So your solution worked, and I used it correctly, but I am just wondering how I can delete the other error in the dataset (with the conditions) as well with an end result of 1 correct table.
Do you understand what I mean?
Best,
Jeroen