The classifier experiment procedure is used to train and evaluate a classifier in a single run.
It wraps the classifier.train
procedure type and the classifier.test
procedure type into a single and easier to use procedure. It can also perform k-fold cross-validation by specifying multiple folds over the data to use for training and testing.
The classifier.experiment
procedure will run multiple rounds of training and testing, based on the settings of the inputData
, kfold
, datasetFolds
and testingDataOverride
parameters, according to the steps laid out below.
A new procedure of type classifier.experiment
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "classifier.experiment",
"params": {
"experimentName": <string>,
"mode": <ClassifierMode>,
"multilabelStrategy": <MultilabelStrategy>,
"recallOverN": <ARRAY [ int ]>,
"inputData": <InputQuery>,
"kfold": <int>,
"datasetFolds": <ARRAY [ DatasetFoldConfig ]>,
"testingDataOverride": <InputQuery (Optional)>,
"algorithm": <string>,
"configuration": <JSON>,
"configurationFile": <string>,
"equalizationFactor": <float>,
"modelFileUrlPattern": <Url>,
"keepArtifacts": <bool>,
"evalTrain": <bool>,
"outputAccuracyDataset": <bool>,
"uniqueScoresOnly": <bool>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
experimentName | A string without spaces which will be used to name the various datasets, procedures and functions created this procedure runs. |
mode | Model mode: |
multilabelStrategy | Multilabel strategy: |
recallOverN | Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes. |
inputData | SQL query which specifies the features, labels and optional weights for the training and testing procedures. This query is used to create a training and testing set according to the steps laid out below. The query should be of the form The select expression must contain these two columns:
The select expression can contain an optional The query must not contain |
kfold | Do a k-fold cross-validation. This is a helper parameter that generates a DatasetFoldConfig splitting the dataset into k subsamples, each fold testing on one 0 means it is not used and 1 is an invalid number. It cannot be specified at the same time as the |
datasetFolds | DatasetFoldConfig to use. This parameter can be used if the dataset folds required are more complex than a simple k-fold cross-validation. It cannot be specified as the same time as the |
testingDataOverride | SQL query which overrides the input data for the testing procedure. This optional parameter must be of the same form as the This query is used to create a test set according to the steps laid out below. This parameter is useful when it is necessary to test on data contained in a different dataset from the training data, or to calculate accuracy statistics with uneven weighting, for example to counteract the effect of non-uniform sampling in the The query must not contain |
algorithm | Algorithm to use to train classifier with. This must point to an entry in the configuration or configurationFile parameters. See the classifier configuration documentation for details. |
configuration | Configuration object to use for the classifier. Each one has its own parameters. If none is passed, then the configuration will be loaded from the ConfigurationFile parameter. See the classifier configuration documentation for details. |
configurationFile | File to load configuration from. This is a JSON file containing only objects, strings and numbers. If the configuration object is non-empty, then that will be used preferentially. See the classifier configuration documentation for details. |
equalizationFactor | Amount to adjust weights so that all classes have an equal total weight. A value of 0 (default) will not equalize weights at all. A value of 1 will ensure that the total weight for both positive and negative examples is exactly identical. A number between will choose a balanced tradeoff. Typically 0.5 is a good number to use for unbalanced probabilities. See the classifier configuration documentation for details. |
modelFileUrlPattern | URL where the model file (with extension '.cls') should be saved. It should include the string $runid that will be replaced by an identifier for each run, if using multiple dataset folds. |
keepArtifacts | If true, all procedures and intermediary datasets are kept. |
evalTrain | Run the evaluation on the training set. If true, the same performance statistics that are returned for the testing set will also be returned for the training set. |
outputAccuracyDataset | If true, an output dataset for scored examples will created for each fold. |
uniqueScoresOnly | If |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
This procedures supports many training algorithm. The configuration is explained on the classifier configuration page.
The experiment procedure supports k-fold
cross-validation in a
flexible way. Folds can be specified implicitly using the kfolds
parameter, or passing in a DatasetFoldConfig to the datasetFolds
parameter as follows:
Field, Type, Default | Description |
---|---|
trainingWhere | The WHERE clause for which rows to include from the training dataset. This can be any expression involving the columns in the dataset. |
testingWhere | The WHERE clause for which rows to include from the testing dataset. This can be any expression involving the columns in the dataset. |
trainingOffset | How many rows to skip before using data. |
trainingLimit | How many rows of data to use. -1 (the default) means use all of the rest of the rows in the dataset after skipping OFFSET rows. |
testingOffset | How many rows to skip before using data. |
testingLimit | How many rows of data to use. -1 (the default) means use all of the rest of the rows in the dataset after skipping OFFSET rows. |
trainingOrderBy | How to order the rows. This only has an effect when |
testingOrderBy | How to order the rows. This only has an effect when |
To perform a 3-fold cross-validation, you can set the parameter kfold
to 3 or equivalently, set the datasetFolds
parameter to the following value:
[
{
"trainingWhere": "rowHash() % 3 != 0",
"testingWhere": "rowHash() % 3 = 0",
},
{
"trainingWhere": "rowHash() % 3 != 1",
"testingWhere": "rowHash() % 3 = 1",
},
{
"trainingWhere": "rowHash() % 3 != 2",
"testingWhere": "rowHash() % 3 = 2",
}
]
The classifier.experiment
procedure will run multiple rounds of training and testing, based on the settings of the inputData
, kfold
, datasetFolds
and testingDataOverride
parameters.
The procedure first assembles a DatasetFoldConfig using the following rules:
datasetFolds
is specified, that is the DatasetFoldConfigkfold
is specified, a DatasetFoldConfig is generated as per the example abovetestingDataOverride
is specified, the DatasetFoldConfig is [{"trainingWhere": "true", "testingWhere": "true"}]
(i.e. the procedure will train on inputData
and test on testingDataOverride
)inputData
specified) the DatasetFoldConfig is [{"trainingWhere": "rowHash() % 2 != 1", "testingWhere": "rowHash() % 2 = 1"}]
(i.e. the procedure will train on half of inputData
and test on the other half)The procedure then assembles training and testing sets for each entry in the DatasetFoldConfig:
inputData
with the trainingWhere
, trainingOffset
, trainingLimit
and orderBy
parameters of the DatasetFoldConfig entryinputData
(or testingDataOverride
if specified) with the testingWhere
, testingOffset
, testingLimit
and orderBy
parameters of the DatasetFoldConfig entry. The procedure will automatically use the classifier
function generated by the training and call it with the features in the testing query to generate a score to compare to the label.The output will contain performance metrics over each fold that was tested. See the
classifier.test
procedure type page for a sample output.
An aggregated version of the metrics over all folds is also provided. The aggregated
results blob will have the same structure as each fold but every numeric metric will
be replaced by an object containing standard statistics (min, max, mean, std).
In the example below, the auc
metric is used to illustrate the aggregation.
The following example would be for a 2-fold run:
{
"id" : "<id>",
"runFinished" : "...",
"runStarted" : "...",
"state" : "finished",
"status" : {
"aggregatedTest": {
"auc": {
"max": x,
"mean": y,
"min": z,
"std": k
},
...
},
"folds": [
{
"accuracyDataset" : <id of fold 1 accuracy dataset if it was generated>,
"modelFileUrl" = <path of fold 1 model file>,
"functionName" = <name of fold 1 scorer function>,
"resultsTest" : { <classifier.test output for fold 1> },
"fold": { <datasetFold used for fold 1> }
},
{
"accuracyDataset" : <id of fold 2 accuracy dataset if it was generated>,
"modelFileUrl" = <path of fold 2 model file>,
"functionName" = <name of fold 2 scorer function>,
"resultsTest" : { <classifier.test output for fold 2> },
"fold": { <datasetFold used for fold 2> }
}
]
}
}
If the training set is also evaluated, the additional keys aggregatedTrain
and resultsTrain
will also be returned and will have the same structure as the results for the testing set.