The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
How can I filter rows based on comparisons to other rows?
My desired filter is "Remove any row that has another row within X rows whose timestamp is more than Y seconds away from the timestamp of the next row"
My data is a series of about 600 CSV files with around 7000 rows and <500kB each, totalling less than 400MB. If it makes a difference, each file is in order by the Time column. Sample data:
Time (s) | Data |
0 | A |
1.012 | B |
2.022 | C |
3.032 | D |
4.041 | E |
5.05 | F |
5.906 | G |
5.917 | H |
5.937 | I |
5.948 | J |
5.959 | K |
5.97 | L |
5.98 | M |
5.991 | N |
The desired result is to remove A, B, C, D, E, F, G, H, I, J for an X rows of 3 and Y seconds of 0.025
I tried adding an index column and creating a custom column using List.Range to calculate the time difference to the previous row
try [#"Time (s)"] - List.Range(#"Reordered Columns"[#"Time (s)"], [Index]-1, 1){0} otherwise #"Max Timestep"+1
Followed by creating a second column again using List.Range to check if any values nearby exceed Y
try List.Max( List.Range(#"Reordered Columns1"[Time to Previous], [Index]-#"Filter Range", (#"Filter Range"*2)+1) ) < #"Max Timestep" otherwise false
Then just filtered out any rows that had false in that column.
This worked when testing on a single file, but encountered issues when I tried it on the whole folder. Loading has still not completed after about 3 hours at the time of writing (Imports of similar datasets without the filtering have completed in minutes in the past), and I am seeing insane file sizes in the load dialog - 100+GB from individual files in some cases.
What is a better way to do this kind of filtering?