Probabilizer Training Procedure

This procedure trains a probabilizer model and stores the model file to disk.

The probabilizer plays the role of transforming the output of a classifier, which may have any range, into a calibrated pseudo-probability that can be used for value calculations. It is used for five main purposes:

  1. When there are multiple classifiers, or multiple types of classifiers, who need their output to be comparable. For example, if you want to compare the output of a decision tree with probabilistic buckets with the output of a neural network with a softmax activation function, you need to transform them into comparable quantities (ie, probabilities). The probabilizer allows you to do this.
  2. When you have trained a classifier that outputs something other than a probability, and you need to turn this into a probability.
  3. When you need to set thresholds that are independent of the type of classifier used. For example, if you are implementing business logic that says that only applications with a 70% probability of being fraudulent may be skipped, you need a way of turning the "0.1033" from the classifier into "71.2% probably of being fraudulent".
  4. When you have trained a classifier on biased data, for example by sampling or weighting the positive and negative examples differently, and you need to correct for this bias. As an example, imagine that you have 1 million examples of browsing sessions that didn't result in the purchase of a product, and 10,000 that did. You may choose to train a classifier on the 10,000 positive examples but sample 20,000 of the negative examples. In this dataset, the purchase prior is 33% but in the real data, it's around 1%. The probabilizer can help correct for the bias.
  5. When a classifier is frequently retrained, but the output from one training to the next needs to be consistent (for example, it feeds into the same business logic or a subsequent processing step that is retrained less frequently).

Algorithm

The probabilizer training uses a generalized linear model to learn a monotonic transformation of the output of the classifier onto a probability space.

Configuration

A new procedure of type probabilizer.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "probabilizer.train",
    "params": {
        "trainingData": <InputQuery>,
        "link": <ML::Link_Function>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

SQL query that specifies the scores, labels and optional weights for the probabilizer training procedure. The query should be of the form select x as score, y as label from ds.

The select expression must contain these two columns:

  • score: output of a classifier function applied to that row
  • label: one boolean (0 or 1), so training a probabilizer only works for a classifier trained with mode boolean. Rows with null labels will be ignored.
  • weight: relative importance of examples. It must be a real number. A weight of 2.0 is equivalent to including the identical row twice in the training dataset. If the weight is not specified each row will have a weight of 1. Rows with a null weight will cause a training error. This can be used to counteract the effect of sampling or weighting over the dataset that the probabilizer is trained on. The default will weight each example the same.

The query must not contain GROUP BY or HAVING clauses and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.

link
ML::Link_Function
"LOGIT"

Link function to use.

modelFileUrl
Url

URL where the model file (with extension '.prb') should be saved. This file can be loaded by the probabilizer function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the probabilizer function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Enumeration ML::Link_Function

ValueDescription
LOGIT

Logit, good generic link for probabilistic

PROBIT

Probit, advanced usage

COMP_LOG_LOG

Also good for probabilistic

LINEAR

Linear; makes it solve linear least squares (identity)

LOG

Logarithm; good for transforming the output of boosting

Important note

A probabilizer is trained on the output of a classifier applied over a dataset. The dataset should not have been used to train the classifier, as this will result in a biased probabilizer.

See also