Classifier Training Procedure

This procedure trains a binary random forest classifier model and stores the model file.

This procedure is a variant of the generic bagged decision tree classifier (see classifier.train procedure type) that has been optimized for binary classification on dense, tabular data and forest of trees.

Configuration

A new procedure of type randomforest.binary.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "randomforest.binary.train",
    "params": {
        "trainingData": <InputQuery>,
        "modelFileUrl": <Url>,
        "featureVectorSamplings": <int>,
        "featureVectorSamplingProp": <float>,
        "featureSamplings": <int>,
        "featureSamplingProp": <float>,
        "maxDepth": <int>,
        "functionName": <string>,
        "verbosity": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	Specification of the data for input to the classifier procedure. The select expression must contain these two sub-expressions: one row expression to identify the features on which to train and one scalar expression to identify the label. The type of the label expression must be a boolean (0 or 1)Labels with a null value will have their row skipped. The select statement does not support groupby and having clauses. Also, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the select expression, create a dataset with the derived columns as a previous step and run the classifier over that dataset instead.
modelFileUrl Url	URL where the model file (with extension '.cls') should be saved. This file can be loaded by the `classifier` function type.
featureVectorSamplings int `5`	Number of samplings of feature vectors. The total number of bags will be featureVectorSamplings*featureSamplings.
featureVectorSamplingProp float `0.30000001192092896`	Proportion of feature vectors to select in each sample.
featureSamplings int `20`	Number of samplings of features. The total number of bags will be `featureVectorSamplings`*`featureSamplings`.
featureSamplingProp float `0.30000001192092896`	Proportion of features to select in each sample.
maxDepth int `20`	Maximum depth of the trees
functionName string	If specified, an instance of the `classifier` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
verbosity bool `false`	Should the procedure be verbose for debugging and tuning purposes
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Input data

This classification procedure will work most efficiently on datasets that have their data well-organized by column, such as the Tabular dataset.

This optimized version only support dense values, with all training samples containing no null values.

It only supports binary classification. The generic classifier.train procedure supports regression and multi-class classification.

Feature values can be numeric or strings. Strictly numeric features will be considered as ordinal, while feature that contains only strings or a mix of strings and numeric values will be considered as nominal. Other value types (blobs, timestamps, intervals, etc) are not yet supported.

Output model

The resulting model is a .cls classifier model that is compatible with the classifier function and the classifier.test procedure.

The classifier.train procedure type trains a classifier.
The classifier.test procedure type allows the accuracy of a predictor to be tested against held-out data.
The classifier function type applies a classifier to a feature vector, producing a classification score.