Classifier Training Procedure

This procedure trains a binary random forest classifier model and stores the model file.

This procedure is a variant of the generic bagged decision tree classifier (see classifier.train procedure type) that has been optimized for binary classification on dense, tabular data and forest of trees.

Configuration

A new procedure of type randomforest.binary.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "randomforest.binary.train",
    "params": {
        "trainingData": <InputQuery>,
        "modelFileUrl": <Url>,
        "featureVectorSamplings": <int>,
        "featureVectorSamplingProp": <float>,
        "featureSamplings": <int>,
        "featureSamplingProp": <float>,
        "maxDepth": <int>,
        "functionName": <string>,
        "verbosity": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

Specification of the data for input to the classifier procedure. The select expression must contain these two sub-expressions: one row expression to identify the features on which to train and one scalar expression to identify the label. The type of the label expression must be a boolean (0 or 1)Labels with a null value will have their row skipped. The select statement does not support groupby and having clauses. Also, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the select expression, create a dataset with the derived columns as a previous step and run the classifier over that dataset instead.

modelFileUrl
Url

URL where the model file (with extension '.cls') should be saved. This file can be loaded by the classifier function type.

featureVectorSamplings
int
5

Number of samplings of feature vectors. The total number of bags will be featureVectorSamplings*featureSamplings.

featureVectorSamplingProp
float
0.30000001192092896

Proportion of feature vectors to select in each sample.

featureSamplings
int
20

Number of samplings of features. The total number of bags will be featureVectorSamplings*featureSamplings.

featureSamplingProp
float
0.30000001192092896

Proportion of features to select in each sample.

maxDepth
int
20

Maximum depth of the trees

functionName
string

If specified, an instance of the classifier function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

verbosity
bool
false

Should the procedure be verbose for debugging and tuning purposes

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Input data

This classification procedure will work most efficiently on datasets that have their data well-organized by column, such as the Tabular dataset.

This optimized version only support dense values, with all training samples containing no null values.

It only supports binary classification. The generic classifier.train procedure supports regression and multi-class classification.

Feature values can be numeric or strings. Strictly numeric features will be considered as ordinal, while feature that contains only strings or a mix of strings and numeric values will be considered as nominal. Other value types (blobs, timestamps, intervals, etc) are not yet supported.

Output model

The resulting model is a .cls classifier model that is compatible with the classifier function and the classifier.test procedure.