Classifier Experiment Procedure

The classifier experiment procedure is used to train and evaluate a classifier in a single run. It wraps the classifier.train procedure type and the classifier.test procedure type into a single and easier to use procedure. It can also perform k-fold cross-validation by specifying multiple folds over the data to use for training and testing.

The classifier.experiment procedure will run multiple rounds of training and testing, based on the settings of the inputData, kfold, datasetFolds and testingDataOverride parameters, according to the steps laid out below.

Configuration

A new procedure of type classifier.experiment named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "classifier.experiment",
    "params": {
        "experimentName": <string>,
        "mode": <ClassifierMode>,
        "multilabelStrategy": <MultilabelStrategy>,
        "recallOverN": <ARRAY [ int ]>,
        "inputData": <InputQuery>,
        "kfold": <int>,
        "datasetFolds": <ARRAY [ DatasetFoldConfig ]>,
        "testingDataOverride": <InputQuery (Optional)>,
        "algorithm": <string>,
        "configuration": <JSON>,
        "configurationFile": <string>,
        "equalizationFactor": <float>,
        "modelFileUrlPattern": <Url>,
        "keepArtifacts": <bool>,
        "evalTrain": <bool>,
        "outputAccuracyDataset": <bool>,
        "uniqueScoresOnly": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

experimentName
string

A string without spaces which will be used to name the various datasets, procedures and functions created this procedure runs.

mode
ClassifierMode
"boolean"

Model mode: boolean, regression or categorical. Controls how the label is interpreted and what is the output of the classifier.

multilabelStrategy
MultilabelStrategy
"one-vs-all"

Multilabel strategy: random, decompose or onevsallControls how examples are prepared to handle multilabel classification. *random will select a label at random among the example's set *decompose will decompose the multilabel combination into single-label examples *onevsall will train a probabilized binary classifier for each labelThis only applies if mode is equal to multilabel

recallOverN
ARRAY [ int ]

Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes.

inputData
InputQuery

SQL query which specifies the features, labels and optional weights for the training and testing procedures. This query is used to create a training and testing set according to the steps laid out below.

The query should be of the form select {f1, f2} as features, x as label from ds.

The select expression must contain these two columns:

  • features: a row expression to identify the features on which to train, and
  • label: one expression to identify the row's label(s), and whose type must match that of the classifier mode. Rows with null labels will be ignored.
    • boolean mode: a boolean (0 or 1)
    • regression mode: a real number
    • categorical mode: any combination of numbers and strings
    • multilabel mode: a row, in which each non-null column is a separate label

The select expression can contain an optional weight column. The weight allows the relative importance of examples to be set. It must be a real number. If the weight is not specified each row will have a weight of 1. Rows with a null weight will cause a training error.

The query must not contain WHERE, LIMIT, OFFSET, GROUP BY or HAVING clauses (they can be defined in datasetFolds) and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.

kfold
int
0

Do a k-fold cross-validation. This is a helper parameter that generates a DatasetFoldConfig splitting the dataset into k subsamples, each fold testing on one kth of the input data and training on the rest.

0 means it is not used and 1 is an invalid number. It cannot be specified at the same time as the datasetFolds parameter and in general should not be used at the same time as the testingDataOverride parameter.

datasetFolds
ARRAY [ DatasetFoldConfig ]

DatasetFoldConfig to use. This parameter can be used if the dataset folds required are more complex than a simple k-fold cross-validation. It cannot be specified as the same time as the kfold parameter and in general should not be used at the same time as the testingDataOverride parameter.

testingDataOverride
InputQuery (Optional)

SQL query which overrides the input data for the testing procedure. This optional parameter must be of the same form as the inputData parameter above, and by default takes the same value as inputData.

This query is used to create a test set according to the steps laid out below.

This parameter is useful when it is necessary to test on data contained in a different dataset from the training data, or to calculate accuracy statistics with uneven weighting, for example to counteract the effect of non-uniform sampling in the inputData.

The query must not contain WHERE, LIMIT, OFFSET, GROUP BY or HAVING clauses (they can be defined in datasetFolds) and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.

algorithm
string

Algorithm to use to train classifier with. This must point to an entry in the configuration or configurationFile parameters. See the classifier configuration documentation for details.

configuration
JSON

Configuration object to use for the classifier. Each one has its own parameters. If none is passed, then the configuration will be loaded from the ConfigurationFile parameter. See the classifier configuration documentation for details.

configurationFile
string
"/opt/bin/classifiers.json"

File to load configuration from. This is a JSON file containing only objects, strings and numbers. If the configuration object is non-empty, then that will be used preferentially. See the classifier configuration documentation for details.

equalizationFactor
float
0.5

Amount to adjust weights so that all classes have an equal total weight. A value of 0 (default) will not equalize weights at all. A value of 1 will ensure that the total weight for both positive and negative examples is exactly identical. A number between will choose a balanced tradeoff. Typically 0.5 is a good number to use for unbalanced probabilities. See the classifier configuration documentation for details.

modelFileUrlPattern
Url

URL where the model file (with extension '.cls') should be saved. It should include the string $runid that will be replaced by an identifier for each run, if using multiple dataset folds.

keepArtifacts
bool
false

If true, all procedures and intermediary datasets are kept.

evalTrain
bool
false

Run the evaluation on the training set. If true, the same performance statistics that are returned for the testing set will also be returned for the training set.

outputAccuracyDataset
bool
true

If true, an output dataset for scored examples will created for each fold.

uniqueScoresOnly
bool
false

If outputAccuracyDataset is set and mode is set to boolean, setting this parameter to true will output a single row per unique score. This is useful if the test set is very large and aggregate statistics for each unique score is sufficient, for instance to generate a ROC curve. This has no effect for other values of mode.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Algorithm configuration

This procedures supports many training algorithm. The configuration is explained on the classifier configuration page.

Dataset Folds and Cross-validation

The experiment procedure supports k-fold cross-validation in a flexible way. Folds can be specified implicitly using the kfolds parameter, or passing in a DatasetFoldConfig to the datasetFolds parameter as follows:

DatasetFoldConfig

Field, Type, DefaultDescription

trainingWhere
string
"true"

The WHERE clause for which rows to include from the training dataset. This can be any expression involving the columns in the dataset.

testingWhere
string
"true"

The WHERE clause for which rows to include from the testing dataset. This can be any expression involving the columns in the dataset.

trainingOffset
int
0

How many rows to skip before using data.

trainingLimit
int
-1

How many rows of data to use. -1 (the default) means use all of the rest of the rows in the dataset after skipping OFFSET rows.

testingOffset
int
0

How many rows to skip before using data.

testingLimit
int
-1

How many rows of data to use. -1 (the default) means use all of the rest of the rows in the dataset after skipping OFFSET rows.

trainingOrderBy
SqlOrderByExpression
"rowHash()"

How to order the rows. This only has an effect when trainingOffset or trainingLimit are used.

testingOrderBy
SqlOrderByExpression
"rowHash()"

How to order the rows. This only has an effect when testingOffset or testingLimit are used.

Example: 3-fold cross-validation

To perform a 3-fold cross-validation, you can set the parameter kfold to 3 or equivalently, set the datasetFolds parameter to the following value:

[
    {
        "trainingWhere": "rowHash() % 3 != 0",
        "testingWhere": "rowHash() % 3 = 0",
    },
    {
        "trainingWhere": "rowHash() % 3 != 1",
        "testingWhere": "rowHash() % 3 = 1",
    },
    {
        "trainingWhere": "rowHash() % 3 != 2",
        "testingWhere": "rowHash() % 3 = 2",
    }
]

Training and Testing Set Generation

The classifier.experiment procedure will run multiple rounds of training and testing, based on the settings of the inputData, kfold, datasetFolds and testingDataOverride parameters.

The procedure first assembles a DatasetFoldConfig using the following rules:

  1. if datasetFolds is specified, that is the DatasetFoldConfig
  2. else if kfold is specified, a DatasetFoldConfig is generated as per the example above
  3. else if testingDataOverride is specified, the DatasetFoldConfig is [{"trainingWhere": "true", "testingWhere": "true"}] (i.e. the procedure will train on inputData and test on testingDataOverride)
  4. else (i.e. base case of only inputData specified) the DatasetFoldConfig is [{"trainingWhere": "rowHash() % 2 != 1", "testingWhere": "rowHash() % 2 = 1"}] (i.e. the procedure will train on half of inputData and test on the other half)

The procedure then assembles training and testing sets for each entry in the DatasetFoldConfig:

  1. The training set will be the result of the query built by combining inputData with the trainingWhere, trainingOffset, trainingLimit and orderBy parameters of the DatasetFoldConfig entry
  2. The testing query will be the result of the query built by combining inputData (or testingDataOverride if specified) with the testingWhere, testingOffset, testingLimit and orderBy parameters of the DatasetFoldConfig entry. The procedure will automatically use the classifier function generated by the training and call it with the features in the testing query to generate a score to compare to the label.

Output

The output will contain performance metrics over each fold that was tested. See the classifier.test procedure type page for a sample output.

An aggregated version of the metrics over all folds is also provided. The aggregated results blob will have the same structure as each fold but every numeric metric will be replaced by an object containing standard statistics (min, max, mean, std). In the example below, the auc metric is used to illustrate the aggregation.

The following example would be for a 2-fold run:

{
    "id" : "<id>",
    "runFinished" : "...",
    "runStarted" : "...",
    "state" : "finished",
    "status" : {
        "aggregatedTest": {
            "auc": {
                "max": x,
                "mean": y,
                "min": z,
                "std": k
            },
            ...
        },
        "folds": [
            {
                "accuracyDataset" : <id of fold 1 accuracy dataset if it was generated>,
                "modelFileUrl" = <path of fold 1 model file>,
                "functionName" = <name of fold 1 scorer function>,
                "resultsTest" : { <classifier.test output for fold 1> },
                "fold": { <datasetFold used for fold 1> }
            },
            {
                "accuracyDataset" : <id of fold 2 accuracy dataset if it was generated>,
                "modelFileUrl" = <path of fold 2 model file>,
                "functionName" = <name of fold 2 scorer function>,
                "resultsTest" : { <classifier.test output for fold 2> },
                "fold": { <datasetFold used for fold 2> }
            }
        ]
    }
}

If the training set is also evaluated, the additional keys aggregatedTrain and resultsTrain will also be returned and will have the same structure as the results for the testing set.

Examples

See also