Probabilizer Training Procedure

This procedure trains a probabilizer model and stores the model file to disk.

The probabilizer plays the role of transforming the output of a classifier, which may have any range, into a calibrated pseudo-probability that can be used for value calculations. It is used for five main purposes:

When there are multiple classifiers, or multiple types of classifiers, who need their output to be comparable. For example, if you want to compare the output of a decision tree with probabilistic buckets with the output of a neural network with a softmax activation function, you need to transform them into comparable quantities (ie, probabilities). The probabilizer allows you to do this.
When you have trained a classifier that outputs something other than a probability, and you need to turn this into a probability.
When you need to set thresholds that are independent of the type of classifier used. For example, if you are implementing business logic that says that only applications with a 70% probability of being fraudulent may be skipped, you need a way of turning the "0.1033" from the classifier into "71.2% probably of being fraudulent".
When you have trained a classifier on biased data, for example by sampling or weighting the positive and negative examples differently, and you need to correct for this bias. As an example, imagine that you have 1 million examples of browsing sessions that didn't result in the purchase of a product, and 10,000 that did. You may choose to train a classifier on the 10,000 positive examples but sample 20,000 of the negative examples. In this dataset, the purchase prior is 33% but in the real data, it's around 1%. The probabilizer can help correct for the bias.
When a classifier is frequently retrained, but the output from one training to the next needs to be consistent (for example, it feeds into the same business logic or a subsequent processing step that is retrained less frequently).

Algorithm

The probabilizer training uses a generalized linear model to learn a monotonic transformation of the output of the classifier onto a probability space.

Configuration

A new procedure of type probabilizer.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "probabilizer.train",
    "params": {
        "trainingData": <InputQuery>,
        "link": <ML::Link_Function>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	SQL query that specifies the scores, labels and optional weights for the probabilizer training procedure. The query should be of the form `select x as score, y as label from ds`. The select expression must contain these two columns: `score`: output of a classifier function applied to that row `label`: one boolean (0 or 1), so training a probabilizer only works for a classifier trained with mode `boolean`. Rows with null labels will be ignored. `weight`: relative importance of examples. It must be a real number. A weight of 2.0 is equivalent to including the identical row twice in the training dataset. If the `weight` is not specified each row will have a weight of 1. Rows with a null weight will cause a training error. This can be used to counteract the effect of sampling or weighting over the dataset that the probabilizer is trained on. The default will weight each example the same. The query must not contain `GROUP BY` or `HAVING` clauses and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So `X` will work, but not `X + 1`. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.
link ML::Link_Function `"LOGIT"`	Link function to use.
modelFileUrl Url	URL where the model file (with extension '.prb') should be saved. This file can be loaded by the `probabilizer` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `probabilizer` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Enumeration `ML::Link_Function`

Value	Description
`LOGIT`	Logit, good generic link for probabilistic
`PROBIT`	Probit, advanced usage
`COMP_LOG_LOG`	Also good for probabilistic
`LINEAR`	Linear; makes it solve linear least squares (identity)
`LOG`	Logarithm; good for transforming the output of boosting

Important note

A probabilizer is trained on the output of a classifier applied over a dataset. The dataset should not have been used to train the classifier, as this will result in a biased probabilizer.

Probabilizer Training Procedure

Algorithm

Configuration

Enumeration ML::Link_Function

Important note

See also

Enumeration `ML::Link_Function`