Training data entity

marekr · ‎04-01-2019

Automated ML integration with Power BI dataflows allows training and applying Binary Prediction, General Classification and Regression models. The ML models are internally represented as specially marked dataflow entities. I’ll describe how the ML related entities are defined in M language and how they can be edited using the Power Query editor.

The following diagram illustrates the entities generated during the ML process and the dependencies between them:

As you can see above the process of training an ML model creates 2 additional entities besides the model itself: training data entity and testing data entity.

Training data entity

Let's take an example of training a Binary Prediction model on an entity named “Customers” with selected columns “City”, State”, “CreditCardBalance”, “WebEngagementScore” and “CustomerStatus”. The training data entity will get created with the following M definition:

let
  Source = Customers,
  #"Selected columns" = Table.SelectColumns(Source, {"City", "State", "CreditCardBalance", "WebEngagementScore", "CustomerStatus"}),
  #"Removed nulls" = Table.SelectRows(#"Selected columns", each [CustomerStatus] <> null),
  #"Sampled input" = AI.SampleStratifiedWithHoldout("CustomerStatus", Table.RowCount(#"Removed nulls"), #"Removed nulls")
in
  #"Sampled input"

Note that the training entity does two things:

Removes all rows that have nulls in the “historical outcome field” selected during model creation in the UI (here “CustomerStatus”). A supervised machine learning model can only learn from rows which have the outcome field set.
Performs sampling of records of that entity using the AI.SampleStratifiedWithHoldout function. AI.SampleStratifiedWithHoldout takes the following parameters:
1. historical outcome field name (text)
2. number of records in the input table (number)
3. the input table (data).
4. Desired sample size (number) as an optional 4th parameter
5. Desired ratio (number) between training and test holdout as an optional 5th

It returns a table with the same columns as the input table plus a boolean “__IsTraining__” column indicating if a given record should be used for training (true) or testing (false).

ML model entity

The model entity would get created with M definition of:

let
  Source = #"CustomerChurnModel Training Data",
  #"Invoked TrainPrediction" = AIInsights.Contents(){[Key = "AI.Execute"]}[Data]("AI.TrainPrediction", "Regular", [labelColumnName = "CustomerStatus", data = Source]),
  #"Selected training schema columns" = Table.SelectColumns(#"Invoked TrainPrediction", {"TrainingId", "Model", "Stats", "GlobalExplanation", "TrainingSchema", "TrainingAUC", "LabelColumn"})
in
  #"Selected training schema columns"

It uses the training data entity as input and invokes the “AI.TrainPrediction” transform on the AI workload of your premium capacity. Note that the invocation could be written in a simpler form:

#"Invoked TrainPrediction" = AIInsights.Contents(){[Key = "AI.TrainPredictiontexttable"]}[Data]("CustomerStatus", Source)

AIInsights.Contents() returns a table of AI transforms supported by the AI workload with columns such as: Name, Data (the actual M function invoking the transform), Key (unique identifier of the transform). That list includes transforms generated for Azure ML services you have access to and Cognitive Services. The “AI.Execute” transform used in the query generated by the “Add ML model” wizard is a wrapper that allows calling other transforms by passing parameters in a record. That allows adding new optional parameters to existing transforms without breaking existing models with queries generated without passing values for those parameters.

The model entity created by the AI.TrainPrediction transform is currently a table with a single record. This table also contains “Global explanations” for the model. The report works on top of these explanations. In the future there may be multiple records corresponding to versions of the model.

Sample output of AI.TrainPrediction:

Testing data entity

The testing data entity allows you to explore the records used by AutoML for model evaluation and hyperparameter tuning. For the above scenario its generated M definition would be:

let
  Source = #"CustomerChurnModel Training Data",
  #"Filtered rows" = Table.SelectRows(Source, each ([__IsTraining__] = false)),
  #"Invoked Scoring" = CustomerChurnModel.Score(#"Filtered rows", "CustomerChurnModelOutput", 0.5)
in
  #"Invoked Scoring"

It uses the training data entity as input and selects rows where the “__IsTraning__” column added during sampling is set to false. It then applies the scoring function defined as a separate query:

let
  ApplyScoringFunction = (inputQuery as table, newColumn as text, decisionThreshold as number) => let
  MlModel = CustomerChurnModel,
  MlModelJson = try Text.FromBinary(Json.FromValue(MlModel{0})) otherwise "InvalidModel",
  Source = inputQuery,
  SelectedBaseEntityColumns = {"City", "State", "CreditCardBalance", "WebEngagementScore", "CustomerStatus"},
  InputRowCount = Table.RowCount(Source),
  InputTableType = Value.Type(Source),
  SelectedColumnsTypes = List.Transform(SelectedBaseEntityColumns, each Type.TableColumn(InputTableType, _)),
  ScoringFunction = 
        let
            ScoringFunctionScalarType = type function (row as record) as any,
            VectorizedScoringFunction = (input as table) =>
              let
                ExpandedColumns = Table.ExpandRecordColumn(input, "row", SelectedBaseEntityColumns),
                ExpandedColumnsWithTypes = Table.TransformColumnTypes(ExpandedColumns, List.Zip({SelectedBaseEntityColumns, SelectedColumnsTypes})),
                ErrorList = List.Repeat({[Output = null]}, InputRowCount),
                Result = if MlModelJson <> "InvalidModel" then (try Table.ToRecords(AIInsights.Contents(){[Key = "AI.Execute"]}[Data]("AI.ScorePrediction", "Vectorized", [data = ExpandedColumns, scoreParameters = MlModelJson])) otherwise ErrorList) else ErrorList
              in
                Result,
            ScalarVectorScoringFunction = Function.ScalarVector(ScoringFunctionScalarType, VectorizedScoringFunction)
        in
            ScalarVectorScoringFunction,
  AddScoringColumn = Table.AddColumn(Source, newColumn, each ScoringFunction(_)),
  ExpandResultColumns = Table.ExpandRecordColumn(AddScoringColumn, newColumn, {"PredictionScore", "PredictionExplanation"}, {Text.Combine({newColumn, "PredictionScore"}, "."), Text.Combine({newColumn, "PredictionExplanation"}, ".")}),
  LabeledOutput = Table.AddColumn(ExpandResultColumns, Text.Combine({newColumn, "Outcome"}, "."), each Record.Field(_, Text.Combine({newColumn, "PredictionScore"}, ".")) >= decisionThreshold * 100),
  ReplacedErrors = Table.ReplaceErrorValues(LabeledOutput, {{Text.Combine({newColumn, "Outcome"}, "."), null}, {Text.Combine({newColumn, "PredictionScore"}, "."), null}, {Text.Combine({newColumn, "PredictionExplanation"}, "."), null}}),
  TransformTypes = Table.TransformColumnTypes(ReplacedErrors, {{Text.Combine({newColumn, "Outcome"}, "."), type logical}, {Text.Combine({newColumn, "PredictionScore"}, "."), type text}, {Text.Combine({newColumn, "PredictionExplanation"}, "."), type text}})
in
  TransformTypes
in
  ApplyScoringFunction

Note that the decision threshold value passed by the generated testing data entity is 0.5. The scoring function invokes the “AI.ScorePrediction” transform on the AI workload of your premium capacity passing the ML model and the input table. On output 3 new columns get added: Outcome (Boolean), PredictionScore and PredictionExplanation.

Training occurs during dataflow refresh

The “Add ML model” wizard creates the previously defined entities in the dataflow, but it’s only during refresh that the training actually occurs and the resulting model is materialized in the model entity in the dataflow store (Azure Data Lake). After training the model can be applied to another entity with matching schema.

Model application

Applying the trained ML model to an entity, let’s say “NewCustomers” entity, will result in creation on another entity with appended word “enriched” and ML model name, e.g. “NewCustomers enriched CustomerChurnModel”. The definition of such enriched entity would be:

let
  Source = NewCustomers,
  #"Invoked CustomerChurnModel.Score" = CustomerChurnModel.Score(Source, "CustomerChurnModel", 0.5)
in
  #"Invoked CustomerChurnModel.Score"

The enriched entity definition uses the same scoring function as the test data entity.

Please note: The application of the model to an entity also occurs during dataflow refresh when the enriched entity is materialized in the dataflow store.

Marek Rycharski | Principal Software Engineer at Microsoft Power BI (Artificial Intelligence) team

An in depth look at the Automated ML entities in Power BI dataflows

Training data entity

ML model entity

Testing data entity

Training occurs during dataflow refresh

Model application

Unveiling the Power of Lakehouse's SQL Analytics E...

How to provide sample data in the Power BI Forum

Re: Power BI Tip: How To Sort A Column With Duplic...

Enhanced Data Profiling in Power Query: GUI and Ta...

Important Links every Power BI enthusiastic shoul...