The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
Hello
I have a dataset that I cannot fit in Excel and would like to filter the data according to the following logic:
The dataset has two columns: a datetime column and a value column.
The logic for which rows to be retained is as follows:
* 1st record is always kept
* For records other than the first one:
if
|value(i) - value (i-1)| >= 4 then keep this record
else
discard (remove) the record
This way, I target to retain only the records (datetime and value pairs) where the value has changed by a limit value of 4 or more with respect to the previous kept record.
I managed to perform the above logic in excel by constructing a new output column that creates values for each row using either new one or previous one with simple condition (if) function.
Then I filtered that new column and get rid of the repeating values (locally). I tried to explain the details in the following picture, and the attached excel file.
I would like to perform this logic in power query so that I can get rid of the data that I do not need and have less rows that will fit in excel.
I would be very glad if someone can help me with this.
Best regards
Solved! Go to Solution.
See the working here - Open a blank query - Home - Advanced Editor - Remove everything from there and paste the below code to test (later on when you use the query on your dataset, you will have to change the source appropriately)
My first column has date time in sequence, hence may not match yours but second column will match for you.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fdJBDoQgEETRqxjWJtNVoIBXMd7/GlKztjph99Kh+eG+S+C3DoPcIq7/KXsByrN/ITKkMAzWhdNY0yANHsLD4ClsBnu2z1jYjc1l57chElOdYYz+PqgNzaJQHJrkUByaclAcVoM9mxwZzmQhRvIUIsPs91CFqkMVquadVKG2Jp8X", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [datetime = _t, value = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"datetime", type datetime}, {"value", Int64.Type}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each List.Accumulate(List.FirstN(#"Changed Type"[value],[Index]),0,(s,c)=> if Number.Abs(c-s)<4 then s else c)),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each try if #"Added Custom"[Custom]{[Index]-2}=[Custom] then 2 else 1 otherwise 1),
#"Filtered Rows" = Table.SelectRows(#"Added Custom1", each ([Custom.1] = 1)),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom.1", "Custom"})
in
#"Removed Columns"
I suspect the step in @Vijay_A_Verma's solution that causes the slow down is #"Added Custom1".
See if this is faster for bigger data:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fdJBDoQgEETRqxjWJtNVoIBXMd7/GlKztjph99Kh+eG+S+C3DoPcIq7/KXsByrN/ITKkMAzWhdNY0yANHsLD4ClsBnu2z1jYjc1l57chElOdYYz+PqgNzaJQHJrkUByaclAcVoM9mxwZzmQhRvIUIsPs91CFqkMVquadVKG2Jp8X", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [datetime = _t, value = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"datetime", type datetime}, {"value", Int64.Type}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
Vals = List.Buffer(#"Added Index"[value]),
ListGen = List.Generate(
() => [i = 0, C = Vals{0}, D = 1],
each [i] <= List.Count(Vals),
each [
i = [i] + 1,
C = if Number.Abs(Vals{[i]} - [C]) < 4 then [C] else Vals{[i]},
D = if C = [C] then 2 else 1
]
),
ToTable = Table.FromRecords(ListGen),
#"Merged Queries" = Table.NestedJoin(#"Added Index", {"Index"}, ToTable, {"i"}, "ToTable", JoinKind.LeftOuter),
#"Expanded ToTable" = Table.ExpandTableColumn(#"Merged Queries", "ToTable", {"C", "D"}, {"new value", "is there a change"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded ToTable", each ([is there a change] = 1)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"datetime", "value"})
in
#"Removed Other Columns"
See the working here - Open a blank query - Home - Advanced Editor - Remove everything from there and paste the below code to test (later on when you use the query on your dataset, you will have to change the source appropriately)
My first column has date time in sequence, hence may not match yours but second column will match for you.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fdJBDoQgEETRqxjWJtNVoIBXMd7/GlKztjph99Kh+eG+S+C3DoPcIq7/KXsByrN/ITKkMAzWhdNY0yANHsLD4ClsBnu2z1jYjc1l57chElOdYYz+PqgNzaJQHJrkUByaclAcVoM9mxwZzmQhRvIUIsPs91CFqkMVquadVKG2Jp8X", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [datetime = _t, value = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"datetime", type datetime}, {"value", Int64.Type}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each List.Accumulate(List.FirstN(#"Changed Type"[value],[Index]),0,(s,c)=> if Number.Abs(c-s)<4 then s else c)),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each try if #"Added Custom"[Custom]{[Index]-2}=[Custom] then 2 else 1 otherwise 1),
#"Filtered Rows" = Table.SelectRows(#"Added Custom1", each ([Custom.1] = 1)),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Index", "Custom.1", "Custom"})
in
#"Removed Columns"
Thank you very much for your help. It worked. But it is very slow for very big data, and I believe there is not much to do to find a more efficient way.
Best regards
I suspect the step in @Vijay_A_Verma's solution that causes the slow down is #"Added Custom1".
See if this is faster for bigger data:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("fdJBDoQgEETRqxjWJtNVoIBXMd7/GlKztjph99Kh+eG+S+C3DoPcIq7/KXsByrN/ITKkMAzWhdNY0yANHsLD4ClsBnu2z1jYjc1l57chElOdYYz+PqgNzaJQHJrkUByaclAcVoM9mxwZzmQhRvIUIsPs91CFqkMVquadVKG2Jp8X", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [datetime = _t, value = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"datetime", type datetime}, {"value", Int64.Type}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 1, 1, Int64.Type),
Vals = List.Buffer(#"Added Index"[value]),
ListGen = List.Generate(
() => [i = 0, C = Vals{0}, D = 1],
each [i] <= List.Count(Vals),
each [
i = [i] + 1,
C = if Number.Abs(Vals{[i]} - [C]) < 4 then [C] else Vals{[i]},
D = if C = [C] then 2 else 1
]
),
ToTable = Table.FromRecords(ListGen),
#"Merged Queries" = Table.NestedJoin(#"Added Index", {"Index"}, ToTable, {"i"}, "ToTable", JoinKind.LeftOuter),
#"Expanded ToTable" = Table.ExpandTableColumn(#"Merged Queries", "ToTable", {"C", "D"}, {"new value", "is there a change"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded ToTable", each ([is there a change] = 1)),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"datetime", "value"})
in
#"Removed Other Columns"
Thanks alot for the help. Your method is faster. I was getting into trouble in @Vijay_A_Verma's solution for 2 million rows. In your solution I was able to perform operation on 2 million rows easily. But for more than 4 million rows I had problems like computer stuck and won't respond forever. Then I decided to go to another pc with more performance (i9-11900K with 64gb ram, the previous one was a i7-7500u with 8gb ram) The difference is tremendous. Now I can do 8 million rows operations in around 4 minutes and create smaller file with 800k rows.