Probabilizer Training Procedure
This procedure
trains a probabilizer model and stores the model file to disk.
The probabilizer plays the role of transforming the output of a classifier,
which may have any range, into a calibrated pseudo-probability that can
be used for value calculations. It is used for five main purposes:
- When there are multiple classifiers, or multiple types of classifiers,
who need their output to be comparable. For example, if you want to
compare the output of a decision tree with probabilistic buckets with
the output of a neural network with a softmax activation function, you
need to transform them into comparable quantities (ie, probabilities).
The probabilizer allows you to do this.
- When you have trained a classifier that outputs something other than a
probability, and you need to turn this into a probability.
- When you need to set thresholds that are independent of the type of
classifier used. For example, if you are implementing business logic
that says that only applications with a 70% probability of being
fraudulent may be skipped, you need a way of turning the "0.1033" from
the classifier into "71.2% probably of being fraudulent".
- When you have trained a classifier on biased data, for example by sampling or
weighting the positive and negative examples differently, and you need
to correct for this bias. As an example, imagine that you have 1 million
examples of browsing sessions that didn't result in the purchase of a product,
and 10,000 that did. You may choose to train a classifier on the 10,000
positive examples but sample 20,000 of the negative examples. In this
dataset, the purchase prior is 33% but in the real data, it's around 1%. The
probabilizer can help correct for the bias.
- When a classifier is frequently retrained, but the output from one
training to the next needs to be consistent (for example, it feeds
into the same business logic or a subsequent processing step that
is retrained less frequently).
Algorithm
The probabilizer training uses a generalized linear model to learn a monotonic
transformation of the output of the classifier onto a probability space.
Configuration
A new procedure of type probabilizer.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "probabilizer.train",
"params": {
"trainingData": <InputQuery>,
"link": <ML::Link_Function>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
trainingData InputQuery
| SQL query that specifies the scores, labels and optional weights for the probabilizer training procedure. The query should be of the form select x as score, y as label from ds .
The select expression must contain these two columns:
score : output of a classifier function applied to that row
label : one boolean (0 or 1), so training a probabilizer only works for a classifier trained with mode boolean . Rows with null labels will be ignored.
weight : relative importance of examples. It must be a real number. A weight of 2.0 is equivalent to including the identical row twice in the training dataset. If the weight is not specified each row will have a weight of 1. Rows with a null weight will cause a training error. This can be used to counteract the effect of sampling or weighting over the dataset that the probabilizer is trained on. The default will weight each example the same.
The query must not contain GROUP BY or HAVING clauses and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1 . If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.
|
link ML::Link_Function "LOGIT" | Link function to use.
|
modelFileUrl Url
| URL where the model file (with extension '.prb') should be saved. This file can be loaded by the probabilizer function type. This parameter is optional unless the functionName parameter is used.
|
functionName string
| If specified, an instance of the probabilizer function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.
|
runOnCreation bool true | If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.
|
Enumeration ML::Link_Function
Value | Description |
LOGIT | Logit, good generic link for probabilistic |
PROBIT | Probit, advanced usage |
COMP_LOG_LOG | Also good for probabilistic |
LINEAR | Linear; makes it solve linear least squares (identity) |
LOG | Logarithm; good for transforming the output of boosting |
Important note
A probabilizer is trained on the output of a classifier applied over a
dataset. The dataset should not have been used to train the classifier,
as this will result in a biased probabilizer.
See also