Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.
I have a dataset which is intrinsically problematic but have no plan to correct the data source (due to some reasons). Therefore, I need to correct the dataset in PowerQuery. I have done a python script to do so, but found that running a python script in PowerQuery is not efficient. I am seeking a method in PowerQuery to replicate the result.
Here is an example table.
Car ID | Task ID | Type | Start time | Finish time |
Car A | 123 | S | 14/7/2022 10:00:00 | 14/7/2022 10:00:05 |
Car A | 124 | T | 14/7/2022 10:00:01 | 14/7/2022 10:01:00 |
Car A | 124 | T | 14/7/2022 10:00:02 | 14/7/2022 10:01:05 |
Car A | 125 | S | 14/7/2022 10:01:07 | 14/7/2022 10:01:10 |
Car A | 126 | S | 14/7/2022 10:01:13 | 14/7/2022 10:01:13 |
Car B | 234 | T | 14/7/2022 09:02:00 | 14/7/2022 09:02:10 |
Car B | 234 | T | 14/7/2022 09:01:59 | 14/7/2022 09:02:10 |
Car B | 235 | S | 14/7/2022 09:02:06 | 14/7/2022 09:02:20 |
Car B | 236 | S | 14/7/2022 09:02:20 | 14/7/2022 09:02:30 |
Two cars, Car A and Car B. Assume the row order is correct, and no change will be on "Car ID", "Task ID", "Type", and "Finish time".
In "Type", "S" means single, so have only one record for its "Task ID"; "T" means twin, so have two records for its "Task ID".
Here are some rules:
The problematic column is "Start time".
So the resulting table after transformation will be as below. (those corrected records are highlighted in blue)
Car ID | Task ID | Type | Start time | Finish time |
Car A | 123 | S | 14/7/2022 10:00:00 | 14/7/2022 10:00:05 |
Car A | 124 | T | 14/7/2022 10:00:05 | 14/7/2022 10:01:05 |
Car A | 124 | T | 14/7/2022 10:00:05 | 14/7/2022 10:01:05 |
Car A | 125 | S | 14/7/2022 10:01:07 | 14/7/2022 10:01:10 |
Car A | 126 | S | 14/7/2022 10:01:10 | 14/7/2022 10:01:13 |
Car B | 234 | T | 14/7/2022 09:01:59 | 14/7/2022 09:02:10 |
Car B | 234 | T | 14/7/2022 09:01:59 | 14/7/2022 09:02:10 |
Car B | 235 | S | 14/7/2022 09:02:10 | 14/7/2022 09:02:20 |
Car B | 236 | S | 14/7/2022 09:02:20 | 14/7/2022 09:02:30 |
It is a bit long, and I have no idea whether it is feasible in PowerQuery or not. Hope someone can help and give some advice or hints.
Thank you so much!
Solved! Go to Solution.
Hi @Anonymous ,
If I understood well, this should work:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jZDPCoAgDIdfRTwHupmJ3apHqFt06BV6/0OT/hDbikBEPvf525xnO6yb6WxlAQPtYznVLjn0iAZ868vSYLRL9bRr2ietEASE8uRPGzWbZ0e1cypMCgSe3bzYEHR42T1dY5Cd+0xt81874CP7y4Y25j+2nPvMbhSI3JZz34USBrKXHQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Car ID" = _t, #"Task ID" = _t, Type = _t, #"Start time1" = _t, #"Finish time1" = _t]),
#"Changed Type" = Table.Buffer(Table.TransformColumnTypes(Source,{{"Start time1", type datetime}, {"Finish time1", type datetime}})),
#"Grouped Rows" = Table.Group(#"Changed Type", {"Task ID", "Car ID"}, {{"Rows", each _, type table [Car ID=nullable text, Task ID=nullable text, Type=nullable text, Start time1=nullable text, Finish time1=nullable text]}}),
#"Align StartDate for Twin" = Table.TransformColumns(
#"Grouped Rows",
{
{"Rows", (OuterTable)=> Table.AddColumn(OuterTable,"Start time2", (Innertable)=> List.Min(OuterTable[Start time1]), type datetime)}
}),
#"Align EndDate for twin" = Table.TransformColumns(
#"Align StartDate for Twin",
{
{"Rows", (OuterTable)=> Table.AddColumn(OuterTable,"Finish time", (Innertable)=> List.Max(OuterTable[Finish time1]), type datetime)}
}),
Custom1 = Table.Combine(#"Align EndDate for twin"[Rows]),
#"Removed Other Columns" = Table.SelectColumns(Custom1,{"Car ID", "Task ID", "Type", "Start time2", "Finish time"}),
#"Grouped Rows1" = Table.Group(#"Removed Other Columns", {"Car ID"}, {{"Rows", each _, type table [Car ID=nullable text, Task ID=nullable text, Type=nullable text, Start time1=nullable datetime, Finish time1=nullable datetime, Start time=datetime]}}),
#"Added Index Column" = Table.TransformColumns(#"Grouped Rows1", {{"Rows", each Table.AddIndexColumn(_, "Index",0)}}),
#"Added Start time" = Table.TransformColumns(#"Added Index Column",
{
{"Rows", each Table.AddColumn(_,
"Start time",
(a)=>
try
if a[Task ID] = [Task ID]{a[Index]-1}
then null
else
if a[Start time2] > [Finish time]{a[Index]-1}
then
if a[Start time2] < a[Finish time]
then a[Start time2]
else [Finish time]{a[Index]-1}
else [Finish time]{a[Index]-1}
otherwise
a[Start time2])
}
}),
#"Expanded Rows" = Table.ExpandTableColumn(#"Added Start time", "Rows", {"Task ID", "Type", "Start time2", "Finish time", "Start time"}, {"Task ID", "Type", "Start time2", "Finish time", "Start time"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Expanded Rows",{"Car ID", "Task ID", "Type", "Finish time", "Start time"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Other Columns1",{"Car ID", "Task ID", "Type", "Start time", "Finish time"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Reordered Columns",{{"Car ID", type text}, {"Task ID", Int64.Type}, {"Type", type text}, {"Start time", type datetime}, {"Finish time", type datetime}}),
#"Filled Down" = Table.FillDown(#"Changed Type1",{"Start time"})
in
#"Filled Down"
(copy and paste in a blank query)
Here is some updated DAX for the two columns.
NewStartTime =
VAR vThisStart = CarTasks[Start time]
VAR vThisTaskID = CarTasks[Task ID]
VAR vMinStartThisTask =
CALCULATE (
MIN ( CarTasks[Start time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID], CarTasks[Task ID] )
)
VAR vMaxPrevTaskFinish =
CALCULATE (
MAX ( CarTasks[Finish time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID] ),
CarTasks[Task ID] < vThisTaskID
)
VAR vMinStartOverall =
MIN ( vThisStart, vMinStartThisTask )
VAR vResult =
IF (
CarTasks[Start time] = CarTasks[Finish time],
vMaxPrevTaskFinish,
MAX ( vMaxPrevTaskFinish, vMinStartOverall )
)
RETURN
vResult
NewFinishTime =
IF (
CarTasks[Type] = "T",
CALCULATE (
MAX ( CarTasks[Finish time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID], CarTasks[Task ID] )
),
CarTasks[Finish time]
)
Pat
Hi @Anonymous ,
If I understood well, this should work:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("jZDPCoAgDIdfRTwHupmJ3apHqFt06BV6/0OT/hDbikBEPvf525xnO6yb6WxlAQPtYznVLjn0iAZ868vSYLRL9bRr2ietEASE8uRPGzWbZ0e1cypMCgSe3bzYEHR42T1dY5Cd+0xt81874CP7y4Y25j+2nPvMbhSI3JZz34USBrKXHQ==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Car ID" = _t, #"Task ID" = _t, Type = _t, #"Start time1" = _t, #"Finish time1" = _t]),
#"Changed Type" = Table.Buffer(Table.TransformColumnTypes(Source,{{"Start time1", type datetime}, {"Finish time1", type datetime}})),
#"Grouped Rows" = Table.Group(#"Changed Type", {"Task ID", "Car ID"}, {{"Rows", each _, type table [Car ID=nullable text, Task ID=nullable text, Type=nullable text, Start time1=nullable text, Finish time1=nullable text]}}),
#"Align StartDate for Twin" = Table.TransformColumns(
#"Grouped Rows",
{
{"Rows", (OuterTable)=> Table.AddColumn(OuterTable,"Start time2", (Innertable)=> List.Min(OuterTable[Start time1]), type datetime)}
}),
#"Align EndDate for twin" = Table.TransformColumns(
#"Align StartDate for Twin",
{
{"Rows", (OuterTable)=> Table.AddColumn(OuterTable,"Finish time", (Innertable)=> List.Max(OuterTable[Finish time1]), type datetime)}
}),
Custom1 = Table.Combine(#"Align EndDate for twin"[Rows]),
#"Removed Other Columns" = Table.SelectColumns(Custom1,{"Car ID", "Task ID", "Type", "Start time2", "Finish time"}),
#"Grouped Rows1" = Table.Group(#"Removed Other Columns", {"Car ID"}, {{"Rows", each _, type table [Car ID=nullable text, Task ID=nullable text, Type=nullable text, Start time1=nullable datetime, Finish time1=nullable datetime, Start time=datetime]}}),
#"Added Index Column" = Table.TransformColumns(#"Grouped Rows1", {{"Rows", each Table.AddIndexColumn(_, "Index",0)}}),
#"Added Start time" = Table.TransformColumns(#"Added Index Column",
{
{"Rows", each Table.AddColumn(_,
"Start time",
(a)=>
try
if a[Task ID] = [Task ID]{a[Index]-1}
then null
else
if a[Start time2] > [Finish time]{a[Index]-1}
then
if a[Start time2] < a[Finish time]
then a[Start time2]
else [Finish time]{a[Index]-1}
else [Finish time]{a[Index]-1}
otherwise
a[Start time2])
}
}),
#"Expanded Rows" = Table.ExpandTableColumn(#"Added Start time", "Rows", {"Task ID", "Type", "Start time2", "Finish time", "Start time"}, {"Task ID", "Type", "Start time2", "Finish time", "Start time"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Expanded Rows",{"Car ID", "Task ID", "Type", "Finish time", "Start time"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Other Columns1",{"Car ID", "Task ID", "Type", "Start time", "Finish time"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Reordered Columns",{{"Car ID", type text}, {"Task ID", Int64.Type}, {"Type", type text}, {"Start time", type datetime}, {"Finish time", type datetime}}),
#"Filled Down" = Table.FillDown(#"Changed Type1",{"Start time"})
in
#"Filled Down"
(copy and paste in a blank query)
It is remarkable!! It works!!!
@latimeria Thank you so much. Please let me have a look first, will try in a couple of hours and have to get familiar with m-language first
While it is possible to do this in the query editor, it would be easier to do it in DAX calculated columns. Below is an example. It works to generate your desired start times for all but task 126 (I don't understand why that one needs to be updated). Hopefully you can adapt the logic to get what you need. A simpler expression can be made for the finished time.
NewStartTime =
VAR vThisStart = CarTasks[Start time]
VAR vThisTaskID = CarTasks[Task ID]
VAR vMinStartThisTask =
CALCULATE (
MIN ( CarTasks[Start time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID], CarTasks[Task ID] )
)
VAR vMaxPrevTaskFinish =
CALCULATE (
MAX ( CarTasks[Finish time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID] ),
CarTasks[Task ID] < vThisTaskID
)
VAR vMinStartOverall =
MIN ( vThisStart, vMinStartThisTask )
VAR vResult =
MAX ( vMaxPrevTaskFinish, vMinStartOverall )
RETURN
vResult
Pat
@ppm1 wrote:
VAR vThisTaskID = CarTasks[Task ID]
VAR vMaxPrevTaskFinish =
CALCULATE (
MAX ( CarTasks[Finish time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID] ),
CarTasks[Task ID] < vThisTaskID
)
It is brilliant, I have never thought that DAX is so powerful till you showed me this simple and clear code!
But I am wondering why have to include "CarTasks[Task ID] < vThisTask"? As from the "VAR vThisTaskID = CarTasks[Task ID]", "CarTasks[Task ID] < vThisTask" will become "CarTasks[Task ID] < CarTasks[Task ID]".
And for the task 126, if it is not allowed to have "Start time" = "Finish time", otherwise, current "Start time" should be replaced by the last "Finish time". What should I add to the logic?
Many Thanks!
Here is some updated DAX for the two columns.
NewStartTime =
VAR vThisStart = CarTasks[Start time]
VAR vThisTaskID = CarTasks[Task ID]
VAR vMinStartThisTask =
CALCULATE (
MIN ( CarTasks[Start time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID], CarTasks[Task ID] )
)
VAR vMaxPrevTaskFinish =
CALCULATE (
MAX ( CarTasks[Finish time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID] ),
CarTasks[Task ID] < vThisTaskID
)
VAR vMinStartOverall =
MIN ( vThisStart, vMinStartThisTask )
VAR vResult =
IF (
CarTasks[Start time] = CarTasks[Finish time],
vMaxPrevTaskFinish,
MAX ( vMaxPrevTaskFinish, vMinStartOverall )
)
RETURN
vResult
NewFinishTime =
IF (
CarTasks[Type] = "T",
CALCULATE (
MAX ( CarTasks[Finish time] ),
ALLEXCEPT ( CarTasks, CarTasks[Car ID], CarTasks[Task ID] )
),
CarTasks[Finish time]
)
Pat
Thank you so much!!! And thanks for adding an additional "Finish time" column.
@ppm1 Thanks so much. Let me implement this logic tonight, in a few hours. I need to get familiar with the DAX first.
To your question about task 126, it is because "start time" can't be the same as "finish time"...... probably I didn't make it clear in the beginning.
Assume the row order is correct
Can't do that. Add an index column at the earliest possible opportunity.
To fix it, it will take the "Finish time" of the last "Task ID" and replace the current "Start time".
Please explain the rationale behind this rule. Seems to be slightly weird. It also doesn't seem to match your indicated expected result.
@lbendlin wrote:
Please explain the rationale behind this rule. Seems to be slightly weird. It also doesn't seem to match your indicated expected result.
Because the source is problematic at only the "Start time", but no fault at the "Finish time", provided that every task is in sequence, meaning no time overlapping between each other.
So once I spot a fault in "Start time", the remedy is to modify the "Start time" by taking out the overlapping session. For those records without overlapping, their "Start time" would be remained unchanged.
Is it possible to implement this idea in PowerBI PowerQuery without using python?
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.
User | Count |
---|---|
109 | |
102 | |
86 | |
77 | |
70 |
User | Count |
---|---|
120 | |
108 | |
98 | |
83 | |
77 |