Classifier Training Procedure

This procedure trains a classification model and stores the model file to disk.

Configuration

A new procedure of type classifier.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "classifier.train",
    "params": {
        "mode": <ClassifierMode>,
        "multilabelStrategy": <MultilabelStrategy>,
        "trainingData": <InputQuery>,
        "algorithm": <string>,
        "configuration": <JSON>,
        "configurationFile": <string>,
        "equalizationFactor": <float>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
mode ClassifierMode `"boolean"`	Model mode: `boolean`, `regression` or `categorical`. Controls how the label is interpreted and what is the output of the classifier.
multilabelStrategy MultilabelStrategy `"one-vs-all"`	Multilabel strategy: `random` or `decompose`. Controls how examples are prepared to handle multilabel classification.
trainingData InputQuery	SQL query which specifies the features, labels and optional weights for training. The query should be of the form `select {f1, f2} as features, x as label from ds`. The select expression must contain these two columns: `features`: a row expression to identify the features on which to train, and `label`: one expression to identify the row's label(s), and whose type must match that of the classifier mode. Rows with null labels will be ignored. `boolean` mode: a boolean (0 or 1) `regression` mode: a real number `categorical` mode: any combination of numbers and strings `multilabel` mode: a row, in which each non-null column is a separate label The select expression can contain an optional `weight` column. The weight allows the relative importance of examples to be set. It must be a real number. If the `weight` is not specified each row will have a weight of 1. Rows with a null weight will cause a training error. The query must not contain `GROUP BY` or `HAVING` clauses and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So `X` will work, but not `X + 1`. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.
algorithm string	Algorithm to use to train classifier with. This must point to an entry in the configuration or configurationFile parameters. See the classifier configuration documentation for details.
configuration JSON	Configuration object to use for the classifier. Each one has its own parameters. If none is passed, then the configuration will be loaded from the ConfigurationFile parameter. See the classifier configuration documentation for details.
configurationFile string `"/opt/bin/classifiers.json"`	File to load configuration from. This is a JSON file containing only objects, strings and numbers. If the configuration object is non-empty, then that will be used preferentially. See the classifier configuration documentation for details.
equalizationFactor float `0.5`	Amount to adjust weights so that all classes have an equal total weight. A value of 0 will not equalize weights at all. A value of 1 will ensure that the total weight for both positive and negative examples is exactly identical. A number between will choose a balanced tradeoff. Typically 0.5 (default) is a good number to use for unbalanced probabilities. See the classifier configuration documentation for details.
modelFileUrl Url	URL where the model file (with extension '.cls') should be saved. This file can be loaded by the `classifier` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `classifier` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Algorithm configuration

This procedures supports many training algorithm. The configuration is explained on the classifier configuration page.

Status Output

The status of a Classifier procedure training will return a JSON representation of the model parameters of the trained classifier, to allow introspection.

Operation Modes

The mode field controls which mode the classifier will operate in:

boolean mode will use a boolean label, and will predict the probability of the label being true as a single floating point number.
regression mode will use a numeric label, and will predict the value of the label itself.
categorical mode will use a categorical (multi-class) label, and will predict the probability of each of the categories independently. This style therefore produces multiple outputs.
multilabel mode will do multi-label classification by using a set of categorical (multi-class) labels, and will predict the probability of each of the categories independently. This style therefore produces multiple outputs. The multilabelStrategy field controls how multilabel classification is handled.

Multilabel classification

In all operation modes but multilabel, the label is a single scalar value. The multilabel handles categorial classification problems where each example has a set of labels instead of a single one. To this end the label input must be a row. In this row each column with a non-null value will be a label value in the example's set. The column name is used to identify the label, while the value itself is disregarded. This makes multi-label classification easy to use with bag of words, for example.

Examples

The Predicting Titanic Survival demo notebook