Classifier Experiment Procedure

The classifier experiment procedure is used to train and evaluate a classifier in a single run. It wraps the classifier.train procedure type and the classifier.test procedure type into a single and easier to use procedure. It can also perform k-fold cross-validation by specifying multiple folds over the data to use for training and testing.

The classifier.experiment procedure will run multiple rounds of training and testing, based on the settings of the inputData, kfold, datasetFolds and testingDataOverride parameters, according to the steps laid out below.

Configuration

A new procedure of type classifier.experiment named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "classifier.experiment",
    "params": {
        "experimentName": <string>,
        "mode": <ClassifierMode>,
        "multilabelStrategy": <MultilabelStrategy>,
        "recallOverN": <ARRAY [ int ]>,
        "inputData": <InputQuery>,
        "kfold": <int>,
        "datasetFolds": <ARRAY [ DatasetFoldConfig ]>,
        "testingDataOverride": <InputQuery (Optional)>,
        "algorithm": <string>,
        "configuration": <JSON>,
        "configurationFile": <string>,
        "equalizationFactor": <float>,
        "modelFileUrlPattern": <Url>,
        "keepArtifacts": <bool>,
        "evalTrain": <bool>,
        "outputAccuracyDataset": <bool>,
        "uniqueScoresOnly": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
experimentName string	A string without spaces which will be used to name the various datasets, procedures and functions created this procedure runs.
mode ClassifierMode `"boolean"`	Model mode: `boolean`, `regression` or `categorical`. Controls how the label is interpreted and what is the output of the classifier.
multilabelStrategy MultilabelStrategy `"one-vs-all"`	Multilabel strategy: `random`, `decompose` or `onevsall`Controls how examples are prepared to handle multilabel classification. `random` will select a label at random among the example's set `decompose` will decompose the multilabel combination into single-label examples *`onevsall` will train a probabilized binary classifier for each labelThis only applies if `mode` is equal to `multilabel`
recallOverN ARRAY [ int ]	Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes.
inputData InputQuery	SQL query which specifies the features, labels and optional weights for the training and testing procedures. This query is used to create a training and testing set according to the steps laid out below. The query should be of the form `select {f1, f2} as features, x as label from ds`. The select expression must contain these two columns: `features`: a row expression to identify the features on which to train, and `label`: one expression to identify the row's label(s), and whose type must match that of the classifier mode. Rows with null labels will be ignored. `boolean` mode: a boolean (0 or 1) `regression` mode: a real number `categorical` mode: any combination of numbers and strings `multilabel` mode: a row, in which each non-null column is a separate label The select expression can contain an optional `weight` column. The weight allows the relative importance of examples to be set. It must be a real number. If the `weight` is not specified each row will have a weight of 1. Rows with a null weight will cause a training error. The query must not contain `WHERE`, `LIMIT`, `OFFSET`, `GROUP BY` or `HAVING` clauses (they can be defined in datasetFolds) and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So `X` will work, but not `X + 1`. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.
kfold int `0`	Do a k-fold cross-validation. This is a helper parameter that generates a DatasetFoldConfig splitting the dataset into k subsamples, each fold testing on one `k`th of the input data and training on the rest. 0 means it is not used and 1 is an invalid number. It cannot be specified at the same time as the `datasetFolds` parameter and in general should not be used at the same time as the `testingDataOverride` parameter.
datasetFolds ARRAY [ DatasetFoldConfig ]	DatasetFoldConfig to use. This parameter can be used if the dataset folds required are more complex than a simple k-fold cross-validation. It cannot be specified as the same time as the `kfold` parameter and in general should not be used at the same time as the `testingDataOverride` parameter.
testingDataOverride InputQuery (Optional)	SQL query which overrides the input data for the testing procedure. This optional parameter must be of the same form as the `inputData` parameter above, and by default takes the same value as `inputData`. This query is used to create a test set according to the steps laid out below. This parameter is useful when it is necessary to test on data contained in a different dataset from the training data, or to calculate accuracy statistics with uneven weighting, for example to counteract the effect of non-uniform sampling in the `inputData`. The query must not contain `WHERE`, `LIMIT`, `OFFSET`, `GROUP BY` or `HAVING` clauses (they can be defined in datasetFolds) and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So `X` will work, but not `X + 1`. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.
algorithm string	Algorithm to use to train classifier with. This must point to an entry in the configuration or configurationFile parameters. See the classifier configuration documentation for details.
configuration JSON	Configuration object to use for the classifier. Each one has its own parameters. If none is passed, then the configuration will be loaded from the ConfigurationFile parameter. See the classifier configuration documentation for details.
configurationFile string `"/opt/bin/classifiers.json"`	File to load configuration from. This is a JSON file containing only objects, strings and numbers. If the configuration object is non-empty, then that will be used preferentially. See the classifier configuration documentation for details.
equalizationFactor float `0.5`	Amount to adjust weights so that all classes have an equal total weight. A value of 0 (default) will not equalize weights at all. A value of 1 will ensure that the total weight for both positive and negative examples is exactly identical. A number between will choose a balanced tradeoff. Typically 0.5 is a good number to use for unbalanced probabilities. See the classifier configuration documentation for details.
modelFileUrlPattern Url	URL where the model file (with extension '.cls') should be saved. It should include the string $runid that will be replaced by an identifier for each run, if using multiple dataset folds.
keepArtifacts bool `false`	If true, all procedures and intermediary datasets are kept.
evalTrain bool `false`	Run the evaluation on the training set. If true, the same performance statistics that are returned for the testing set will also be returned for the training set.
outputAccuracyDataset bool `true`	If true, an output dataset for scored examples will created for each fold.
uniqueScoresOnly bool `false`	If `outputAccuracyDataset` is set and `mode` is set to `boolean`, setting this parameter to `true` will output a single row per unique score. This is useful if the test set is very large and aggregate statistics for each unique score is sufficient, for instance to generate a ROC curve. This has no effect for other values of `mode`.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Algorithm configuration

This procedures supports many training algorithm. The configuration is explained on the classifier configuration page.

Dataset Folds and Cross-validation

The experiment procedure supports k-fold cross-validation in a flexible way. Folds can be specified implicitly using the kfolds parameter, or passing in a DatasetFoldConfig to the datasetFolds parameter as follows:

DatasetFoldConfig

Field, Type, Default	Description
trainingWhere string `"true"`	The WHERE clause for which rows to include from the training dataset. This can be any expression involving the columns in the dataset.
testingWhere string `"true"`	The WHERE clause for which rows to include from the testing dataset. This can be any expression involving the columns in the dataset.
trainingOffset int `0`	How many rows to skip before using data.
trainingLimit int `-1`	How many rows of data to use. -1 (the default) means use all of the rest of the rows in the dataset after skipping OFFSET rows.
testingOffset int `0`	How many rows to skip before using data.
testingLimit int `-1`	How many rows of data to use. -1 (the default) means use all of the rest of the rows in the dataset after skipping OFFSET rows.
trainingOrderBy SqlOrderByExpression `"rowHash()"`	How to order the rows. This only has an effect when `trainingOffset` or `trainingLimit` are used.
testingOrderBy SqlOrderByExpression `"rowHash()"`	How to order the rows. This only has an effect when `testingOffset` or `testingLimit` are used.

Example: 3-fold cross-validation

To perform a 3-fold cross-validation, you can set the parameter kfold to 3 or equivalently, set the datasetFolds parameter to the following value:

[
    {
        "trainingWhere": "rowHash() % 3 != 0",
        "testingWhere": "rowHash() % 3 = 0",
    },
    {
        "trainingWhere": "rowHash() % 3 != 1",
        "testingWhere": "rowHash() % 3 = 1",
    },
    {
        "trainingWhere": "rowHash() % 3 != 2",
        "testingWhere": "rowHash() % 3 = 2",
    }
]

Training and Testing Set Generation

The classifier.experiment procedure will run multiple rounds of training and testing, based on the settings of the inputData, kfold, datasetFolds and testingDataOverride parameters.

The procedure first assembles a DatasetFoldConfig using the following rules:

if datasetFolds is specified, that is the DatasetFoldConfig
else if kfold is specified, a DatasetFoldConfig is generated as per the example above
else if testingDataOverride is specified, the DatasetFoldConfig is [{"trainingWhere": "true", "testingWhere": "true"}] (i.e. the procedure will train on inputData and test on testingDataOverride)
else (i.e. base case of only inputData specified) the DatasetFoldConfig is [{"trainingWhere": "rowHash() % 2 != 1", "testingWhere": "rowHash() % 2 = 1"}] (i.e. the procedure will train on half of inputData and test on the other half)

The procedure then assembles training and testing sets for each entry in the DatasetFoldConfig:

The training set will be the result of the query built by combining inputData with the trainingWhere, trainingOffset, trainingLimit and orderBy parameters of the DatasetFoldConfig entry
The testing query will be the result of the query built by combining inputData (or testingDataOverride if specified) with the testingWhere, testingOffset, testingLimit and orderBy parameters of the DatasetFoldConfig entry. The procedure will automatically use the classifier function generated by the training and call it with the features in the testing query to generate a score to compare to the label.

Output

The output will contain performance metrics over each fold that was tested. See the classifier.test procedure type page for a sample output.

An aggregated version of the metrics over all folds is also provided. The aggregated results blob will have the same structure as each fold but every numeric metric will be replaced by an object containing standard statistics (min, max, mean, std). In the example below, the auc metric is used to illustrate the aggregation.

The following example would be for a 2-fold run:

{
    "id" : "<id>",
    "runFinished" : "...",
    "runStarted" : "...",
    "state" : "finished",
    "status" : {
        "aggregatedTest": {
            "auc": {
                "max": x,
                "mean": y,
                "min": z,
                "std": k
            },
            ...
        },
        "folds": [
            {
                "accuracyDataset" : <id of fold 1 accuracy dataset if it was generated>,
                "modelFileUrl" = <path of fold 1 model file>,
                "functionName" = <name of fold 1 scorer function>,
                "resultsTest" : { <classifier.test output for fold 1> },
                "fold": { <datasetFold used for fold 1> }
            },
            {
                "accuracyDataset" : <id of fold 2 accuracy dataset if it was generated>,
                "modelFileUrl" = <path of fold 2 model file>,
                "functionName" = <name of fold 2 scorer function>,
                "resultsTest" : { <classifier.test output for fold 2> },
                "fold": { <datasetFold used for fold 2> }
            }
        ]
    }
}

If the training set is also evaluated, the additional keys aggregatedTrain and resultsTrain will also be returned and will have the same structure as the results for the testing set.

Examples

The Predicting Titanic Survival demo notebook