This procedure trains a binary random forest classifier model and stores the model file.
This procedure is a variant of the generic bagged decision tree classifier (see classifier.train
procedure type) that has been
optimized for binary classification on dense, tabular data and forest of trees.
A new procedure of type randomforest.binary.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "randomforest.binary.train",
"params": {
"trainingData": <InputQuery>,
"modelFileUrl": <Url>,
"featureVectorSamplings": <int>,
"featureVectorSamplingProp": <float>,
"featureSamplings": <int>,
"featureSamplingProp": <float>,
"maxDepth": <int>,
"functionName": <string>,
"verbosity": <bool>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | Specification of the data for input to the classifier procedure. The select expression must contain these two sub-expressions: one row expression to identify the features on which to train and one scalar expression to identify the label. The type of the label expression must be a boolean (0 or 1)Labels with a null value will have their row skipped. The select statement does not support groupby and having clauses. Also, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the select expression, create a dataset with the derived columns as a previous step and run the classifier over that dataset instead. |
modelFileUrl | URL where the model file (with extension '.cls') should be saved. This file can be loaded by the |
featureVectorSamplings | Number of samplings of feature vectors. The total number of bags will be featureVectorSamplings*featureSamplings. |
featureVectorSamplingProp | Proportion of feature vectors to select in each sample. |
featureSamplings | Number of samplings of features. The total number of bags will be |
featureSamplingProp | Proportion of features to select in each sample. |
maxDepth | Maximum depth of the trees |
functionName | If specified, an instance of the |
verbosity | Should the procedure be verbose for debugging and tuning purposes |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
This classification procedure will work most efficiently on datasets that have their data well-organized by column, such as the Tabular dataset.
This optimized version only support dense values, with all training samples containing no null values.
It only supports binary classification. The generic classifier.train procedure supports regression and multi-class classification.
Feature values can be numeric or strings. Strictly numeric features will be considered as ordinal, while feature that contains only strings or a mix of strings and numeric values will be considered as nominal. Other value types (blobs, timestamps, intervals, etc) are not yet supported.
The resulting model is a .cls classifier model that is compatible with the classifier function and the classifier.test procedure.
classifier.train
procedure type trains a classifier.classifier.test
procedure type allows the accuracy of a predictor to be tested against
held-out data.classifier
function type applies a classifier to a feature vector, producing a classification score.