Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
dinsyp
Frequent Visitor

How to remove duplicates between initial date and 21 days

Hi People,

 

As a newbie to Power Query language, I looking for guidance.

 

I have a large table (more than 60000 entries) of training records within which some people have repeated the learning activity more than once within a period of 21 days, even on the same day. Removing duplicate entries on the same day I am OK with including using the table.buffer to keep the sorted data in memory. Figuring out the formula for the other removal is not as easy, made probably more complicated by the desire to also run the check within the 21 day period e.g. someone may have initial completed the learning activity on 1 May, then repeats it on 15 May (delete this entry), repeats again on 25 May (delete this entry because its within 21 days of the previous). Only want to retain the 1 May entry.

 

The data is sorted by UserID, CourseID and Completion Date and the duplication check is then looking for duplicate entries within that sort. Hope that makes sense.

 

Have I provided enough information?

 

TIA. Andrew

1 ACCEPTED SOLUTION

Wait what, are you running this in Excel?

 

let
    Source = Excel.CurrentWorkbook(),
    Table61617_2 = Source{[Name="Table61617_2"]}[Content],
    #"Removed Other Columns" = Table.SelectColumns(Table61617_2,{"Participant ID", "Course ID", "Course Completion Date"}),
    #"Changed Type" = Table.Buffer(Table.TransformColumnTypes(#"Removed Other Columns",{{"Course Completion Date", type date}})),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"Participant ID", "Course ID"}, {{"Rows", each _, type table [Course Completion Date=nullable date]}}),
    Check = (tbl)=> Table.AddColumn(tbl,"Flag",(k)=> Table.RowCount(Table.SelectRows(tbl,each [Course Completion Date]<k[Course Completion Date] and [Course Completion Date]>k[Course Completion Date]-#duration(21,0,0,0)))),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "FlagNew", each if Table.RowCount([Rows])=1 then Table.AddColumn([Rows],"Flag",each 0) else Check([Rows])),
    #"Removed Other Columns1" = Table.SelectColumns(#"Added Custom1",{"Participant ID", "Course ID", "FlagNew"}),
    #"Expanded FlagNew" = Table.ExpandTableColumn(#"Removed Other Columns1", "FlagNew", {"Course Completion Date", "Flag"}, {"Course Completion Date", "Flag"})
in
    #"Expanded FlagNew"

View solution in original post

19 REPLIES 19
dinsyp
Frequent Visitor

Looks like I ran into the no headers in table problem. Going to start a new thread as recommended.

if you delete the 15 May entry then the 25 May entry has nothing to compare to.  Instead of deleting rows you will want to mark them.

Thanks. That is what the original VBA was doing - it created the Keep/Delete>21 Days entries. Was hoping to avoid that:-). I tried/am trying to send the data as a proper table but keep getting an error message of too many characters. Is it still useful to try to send?

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("nZa9rt4gDIZvJfrmI8U2f2E/Wy/h9AwdsnWo1Hbo3ddgSIAASbtEivT4xRjz4o+PFwCCd8q83l6kPH8VrGBWAnT882Xff7w+3zoY4gpbxr7+BlD7Qri8f/vz8xKgA4NQBCxXZuMvA6AC40sGwVFU8HFht7JUJ78aI1rBMUY4zC8HsFb4+pX3NFqbog6axLguE3cCVupHMMhPMOT8aJ4fOHQU1w5f8CvEtbf2XCoM1YqUt9HVBQdI/tg317Or22B8NFL2B7paT/JtMGTMZmyuqyD10ZlITzdhWpaPx9DX9cqiORqmOrelZaQICtPdaNeuMWZOqfnapFX4GmnWtqdbzEnPTGp1BoTikr3V1aIrJR3q8j3VVp8bpOIu1x5RYny20oszj0gByqWeGXhPiZEuruET3X9IRAEkIzLdy1ZhCAX2vn/ff+0nSEjHbQ+mavsbG2KNngaxKYgmuI0OoMKo9MrBvqsA7pbbyuYAhEeJROxeF/yGsYzJvrkTdLdzK4xfC2nwsSvlgOTyRiocbbenm7CtwKqDOMH4qt3rCfZAzzzTM3d6zuiiQCZ1gG31asymC2LHdawCgrPYmwNNAUSXgF4igvGB3jZKDlDQ6i5XRnrwbm3B7Oj2FSA906OpHs8C6OGwEcp22rpNg6kCG+mFd4JvThyjqqesYPL7SBPGi73h8Yb284qYeTBqDQPajXANjy4Lfqj6s16JPbC5OgDvbS4HyGCqcGBzE6zuIKeULC0Z6/5DN8NmejzqnIPzcEPoH1W2wNA+KBQqXRmy+u+AS1NHUAZbviTd2TszNtWr+8xOsPaiexVHLbDpddPdea/CQjHdaN77/As=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Participant ID" = _t, #"Course ID" = _t, #"Course Completion" = _t, Decision = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Participant ID", Int64.Type}, {"Course ID", Int64.Type}, {"Course Completion", type date}, {"Decision", type text}},"en-GB"),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Flag", (k)=> Table.RowCount( Table.SelectRows(#"Changed Type",each [Participant ID]=k[Participant ID] and [Course ID]=k[Course ID] and [Course Completion]<k[Course Completion] and [Course Completion]>k[Course Completion]-#duration(21,0,0,0))))
in
    #"Added Custom"

How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

Thanks very much for this:-)

 

I've done a test with the full data set and discovered the code only gerenates a small subset of the data. The full data will expand in each month's report. Wondering how this might be achieved?

 

Andrew

Please provide sanitized sample data that fully covers your issue.

Apologies for the delay. Here's the link to the full set of anaymised data https://correctionsnz.sharefile.com/d-s5fd9a416deda4f49b12e324509ca082a. As I noted, the data set will expand monthly and is always in a table named "Table61617". Thanks for your help.

Here is a preliminary version. It is not very fast but I think I can improve on it.

 

let
    Source = Excel.Workbook(Binary.Buffer(File.Contents("C:\Users\xxx\Downloads\Anonymised Data 230523.xlsx")), null, true),
    Table61617_2_Table = Source{[Item="Table61617_2",Kind="Table"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(Table61617_2_Table,{"Participant ID", "Course ID", "Course Completion Date"}),
    #"Changed Type" = Table.Buffer(Table.TransformColumnTypes(#"Removed Other Columns",{{"Course Completion Date", type date}})),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Flag", (k)=> Table.RowCount( Table.SelectRows(#"Changed Type",each [Participant ID]=k[Participant ID] and [Course ID]=k[Course ID] and [Course Completion Date]<k[Course Completion Date] and [Course Completion Date]>k[Course Completion Date]-#duration(21,0,0,0))))
in
    #"Added Custom"

This version performs better over larger datasets.

 

let
    Source = Excel.Workbook(File.Contents("C:\Users\xxx\Downloads\Anonymised Data 230523.xlsx"), null, true),
    Table61617_2_Table = Source{[Item="Table61617_2",Kind="Table"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(Table61617_2_Table,{"Participant ID", "Course ID", "Course Completion Date"}),
    #"Changed Type" = Table.Buffer(Table.TransformColumnTypes(#"Removed Other Columns",{{"Course Completion Date", type date}})),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"Participant ID", "Course ID"}, {{"Rows", each _, type table [Course Completion Date=nullable date]}}),
    Check = (tbl)=> Table.AddColumn(tbl,"Flag",(k)=> Table.RowCount(Table.SelectRows(tbl,each [Course Completion Date]<k[Course Completion Date] and [Course Completion Date]>k[Course Completion Date]-#duration(21,0,0,0)))),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "FlagNew", each if Table.RowCount([Rows])=1 then Table.AddColumn([Rows],"Flag",each 0) else Check([Rows])),
    #"Removed Other Columns1" = Table.SelectColumns(#"Added Custom1",{"Participant ID", "Course ID", "FlagNew"}),
    #"Expanded FlagNew" = Table.ExpandTableColumn(#"Removed Other Columns1", "FlagNew", {"Course Completion Date", "Flag"}, {"Course Completion Date", "Flag"})
in
    #"Expanded FlagNew"

I've modified the Source based on what the editor cmoes up with then I load the Advanced editor.

let
Source = Excel.CurrentWorkbook(){[Name="Table61617_2"]}[Content],
Table61617_2_Table = Source{[Item="Table61617_2",Kind="Table"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(Table61617_2_Table,{"Participant ID", "Course ID", "Course Completion Date"}),
#"Changed Type" = Table.Buffer(Table.TransformColumnTypes(#"Removed Other Columns",{{"Course Completion Date", type date}})),
#"Grouped Rows" = Table.Group(#"Changed Type", {"Participant ID", "Course ID"}, {{"Rows", each _, type table [Course Completion Date=nullable date]}}),
Check = (tbl)=> Table.AddColumn(tbl,"Flag",(k)=> Table.RowCount(Table.SelectRows(tbl,each [Course Completion Date]<k[Course Completion Date] and [Course Completion Date]>k[Course Completion Date]-#duration(21,0,0,0)))),
#"Added Custom1" = Table.AddColumn(#"Grouped Rows", "FlagNew", each if Table.RowCount([Rows])=1 then Table.AddColumn([Rows],"Flag",each 0) else Check([Rows])),
#"Removed Other Columns1" = Table.SelectColumns(#"Added Custom1",{"Participant ID", "Course ID", "FlagNew"}),
#"Expanded FlagNew" = Table.ExpandTableColumn(#"Removed Other Columns1", "FlagNew", {"Course Completion Date", "Flag"}, {"Course Completion Date", "Flag"})
in
#"Expanded FlagNew"

 Getting the following error. Possibly due to the Source code?

dinsyp_0-1684975992639.png

Thanks for your support.

 

 

Wait what, are you running this in Excel?

 

let
    Source = Excel.CurrentWorkbook(),
    Table61617_2 = Source{[Name="Table61617_2"]}[Content],
    #"Removed Other Columns" = Table.SelectColumns(Table61617_2,{"Participant ID", "Course ID", "Course Completion Date"}),
    #"Changed Type" = Table.Buffer(Table.TransformColumnTypes(#"Removed Other Columns",{{"Course Completion Date", type date}})),
    #"Grouped Rows" = Table.Group(#"Changed Type", {"Participant ID", "Course ID"}, {{"Rows", each _, type table [Course Completion Date=nullable date]}}),
    Check = (tbl)=> Table.AddColumn(tbl,"Flag",(k)=> Table.RowCount(Table.SelectRows(tbl,each [Course Completion Date]<k[Course Completion Date] and [Course Completion Date]>k[Course Completion Date]-#duration(21,0,0,0)))),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "FlagNew", each if Table.RowCount([Rows])=1 then Table.AddColumn([Rows],"Flag",each 0) else Check([Rows])),
    #"Removed Other Columns1" = Table.SelectColumns(#"Added Custom1",{"Participant ID", "Course ID", "FlagNew"}),
    #"Expanded FlagNew" = Table.ExpandTableColumn(#"Removed Other Columns1", "FlagNew", {"Course Completion Date", "Flag"}, {"Course Completion Date", "Flag"})
in
    #"Expanded FlagNew"

Marvellous. Thanks so much for your knowledge and hard work. So much appreciated.

Yes. Sorry I didn't note this.

see message 17

Thanks. Unfortunately, the current data set is 47798 rows of data with 19 columns so will be too large to paste into the forum. The row numbers will increase at least 1000 in the following month. Any other way of providing the sanitised data?

Thanks. I missed the bit about uploading data with a link. Apologies. In process of anonymising the data which might take a while so aim to post a link tomorrow. Thanks again for your help.

lbendlin
Super User
Super User

Please provide sample data that covers your issue or question completely, in a usable format (not as a screenshot).
https://community.powerbi.com/t5/Community-Blog/How-to-provide-sample-data-in-the-Power-BI-Forum/ba-...
Please show the expected outcome based on the sample data you provided.

https://community.powerbi.com/t5/Desktop/How-to-Get-Your-Question-Answered-Quickly/m-p/1447523

Thanks. Below is a small snippet of the data with the first 3 columns being the criteria for the delete decision. The Action column contains the main reason for keeping or deleting, while the Reason for delete expands on that.

 

Participant IDCourse IDCourse Completion DateActionReason for Delete
0010973523930/05/2017Keep 
0010973523911/08/2017> 21 Days 
0010973524710/08/2017  
0010973524808/03/2019  
0011072719917/10/2017Keep 
0011072719922/07/2021> 21 Days 
0011072720129/11/2019  
0011072722115/11/2017  
0011072723506/05/2020Keep 
0011072723512/02/2021> 21 Days 
0010717222209/01/2018Keep 
0010717222213/12/2019> 21 Days 
0007012920117/01/2018Keep 
0007012920108/10/2019> 21 Days 
0007012924409/01/2018Keep 
0007012924419/06/2018> 21 Days 
0007012930710/10/2019Keep 
0007012930714/01/2020> 21 Days 
0009361519906/05/2020  
0009361520131/05/2019Keep 
0009361520105/05/2020> 21 Days 
0009361524325/07/2017Keep 
0009361524327/02/2018> 21 Days 
0009361524426/07/2017Keep 
0009361524427/10/2018> 21 Days 
0010946420102/08/2017Keep 
0010946420119/12/2017> 21 Days 
0010946423709/05/2017Keep 
0010946423724/11/2017> 21 Days 
0010946423719/12/2017> 21 Days 
0010946430008/05/2018Keep 
0010946430010/05/2018Delete< 21 days
0010921223530/06/2017Keep 
0010921223530/06/2017DeleteSame day
0010400120918/08/2017Keep 
0010400120927/10/2017> 21 Days 
0010400120907/11/2017> 21 Days 
0010400121018/08/2017Keep 
0010400121007/11/2017> 21 Days 
0009811819929/04/2017Keep 
0009811819903/07/2019> 21 Days 
0009811820125/06/2021Keep 
0009811820128/06/2021Delete<21 days
0009811824725/06/2021Keep 
0009811824728/06/2021Delete<21 days
0009811825725/06/2021Keep 
0009811825728/06/2021Delete<21 days
0009754819905/10/2016Keep 
0009754819906/12/2016> 21 Days 
0009754819926/06/2017> 21 Days 
0009754822926/06/2017Keep 
0009754822903/11/2017> 21 Days 
0009754823026/06/2017  
0009754823121/06/2017Keep 
0009754823126/06/2017Delete<21 days
0009754823221/06/2017Keep 
0009754823226/06/2017Delete<21 days
0011019030022/08/2018Keep 
0011019030023/08/2018Delete<21 days
0011019030304/03/2020  
0011019030714/02/2020  
0011019030910/11/2020Keep 
0011019030915/07/2021> 21 Days 
0011019030915/07/2021DeleteSame day
0011097519930/03/2017Keep 
0011097519927/10/2017> 21 Days 
0011097519917/11/2017> 21 Days 
0011097524831/08/2017Keep 
0011097524831/08/2017DeleteSame day
0007339919924/05/2017Keep 
0007339919924/05/2017DeleteSame day
0007339919901/03/2019> 21 Days 
0011091919930/03/2017Keep 
0011091919916/11/2017> 21 Days 
0011013419903/03/2017> 21 Days 
0011013419903/03/2017DeleteSame day
0011013422222/11/2017  
0011013422624/06/2017Keep 
0011013422624/06/2017DeleteSame day
0009793320618/04/2019Keep 
0009793320630/07/2020> 21 Days 

Hope this works. Was getting an error that the message was too long.

 

Andrew

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors