Classifier Training Procedure

This procedure trains a classification model and stores the model file to disk.

Configuration

A new procedure of type classifier.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
"type": "classifier.train",
"params": {
"mode": <ClassifierMode>,
"multilabelStrategy": <MultilabelStrategy>,
"trainingData": <InputQuery>,
"algorithm": <string>,
"configuration": <JSON>,
"configurationFile": <string>,
"equalizationFactor": <float>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

mode
ClassifierMode
"boolean"

Model mode: boolean, regression or categorical. Controls how the label is interpreted and what is the output of the classifier.

multilabelStrategy
MultilabelStrategy
"one-vs-all"

Multilabel strategy: random or decompose. Controls how examples are prepared to handle multilabel classification.

trainingData
InputQuery

SQL query which specifies the features, labels and optional weights for training. The query should be of the form select {f1, f2} as features, x as label from ds.

The select expression must contain these two columns:

• features: a row expression to identify the features on which to train, and
• label: one expression to identify the row's label(s), and whose type must match that of the classifier mode. Rows with null labels will be ignored.
• boolean mode: a boolean (0 or 1)
• regression mode: a real number
• categorical mode: any combination of numbers and strings
• multilabel mode: a row, in which each non-null column is a separate label

The select expression can contain an optional weight column. The weight allows the relative importance of examples to be set. It must be a real number. If the weight is not specified each row will have a weight of 1. Rows with a null weight will cause a training error.

The query must not contain GROUP BY or HAVING clauses and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.

algorithm
string

Algorithm to use to train classifier with. This must point to an entry in the configuration or configurationFile parameters. See the classifier configuration documentation for details.

configuration
JSON

Configuration object to use for the classifier. Each one has its own parameters. If none is passed, then the configuration will be loaded from the ConfigurationFile parameter. See the classifier configuration documentation for details.

configurationFile
string
"/opt/bin/classifiers.json"

File to load configuration from. This is a JSON file containing only objects, strings and numbers. If the configuration object is non-empty, then that will be used preferentially. See the classifier configuration documentation for details.

equalizationFactor
float
0.5

Amount to adjust weights so that all classes have an equal total weight. A value of 0 will not equalize weights at all. A value of 1 will ensure that the total weight for both positive and negative examples is exactly identical. A number between will choose a balanced tradeoff. Typically 0.5 (default) is a good number to use for unbalanced probabilities. See the classifier configuration documentation for details.

modelFileUrl
Url

URL where the model file (with extension '.cls') should be saved. This file can be loaded by the classifier function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the classifier function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Algorithm configuration

This procedures supports many training algorithm. The configuration is explained on the classifier configuration page.

Status Output

The status of a Classifier procedure training will return a JSON representation of the model parameters of the trained classifier, to allow introspection.

Operation Modes

The mode field controls which mode the classifier will operate in:

• boolean mode will use a boolean label, and will predict the probability of the label being true as a single floating point number.
• regression mode will use a numeric label, and will predict the value of the label itself.
• categorical mode will use a categorical (multi-class) label, and will predict the probability of each of the categories independently. This style therefore produces multiple outputs.
• multilabel mode will do multi-label classification by using a set of categorical (multi-class) labels, and will predict the probability of each of the categories independently. This style therefore produces multiple outputs. The multilabelStrategy field controls how multilabel classification is handled.

Multilabel classification

In all operation modes but multilabel, the label is a single scalar value. The multilabel handles categorial classification problems where each example has a set of labels instead of a single one. To this end the label input must be a row. In this row each column with a non-null value will be a label value in the example's set. The column name is used to identify the label, while the value itself is disregarded. This makes multi-label classification easy to use with bag of words, for example.