The classifier testing procedure allows the accuracy of a binary classifier, multi-class classifier or regressor to be tested against held-out data. The output of this procedure is a dataset which contains the scores and statistics resulting from the application of a classifier to some input data.
A new procedure of type classifier.test
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "classifier.test",
"params": {
"mode": <ClassifierMode>,
"testingData": <InputQuery>,
"outputDataset": <OutputDatasetSpec (Optional)>,
"uniqueScoresOnly": <bool>,
"recallOverN": <ARRAY [ int ]>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
mode | Model mode: |
testingData | SQL query which specifies the scores, labels and optional weights for evaluation. The query is usually of the form: The select expression must contain these two columns:
The select expression can contain an optional The query must not contain |
outputDataset | Output dataset for scored examples. The score for the test example will be written to this dataset. Examples get grouped when they have the same score when |
uniqueScoresOnly | If |
recallOverN | Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes. |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
After this procedure has been run, a summary of the accuracy can be obtained via
mldb.get("/v1/procedures/<id>/runs/<runid>")
The status
field will contain statistics relevant to the model's mode.
The status
field will contain the Area
Under the Curve under the key auc
, along with the performance statistics (e.g. precision, recall) for the classifier using
the thresholds which give the best MCC and the best F-score, under the keys bestMcc
and
bestF1Score
, respectively.
Here is a sample output:
{
"status": {
"bestMcc": {
"pr": {
"recall": 0.6712328767123288,
"precision": 0.8448275862068966,
"f1Score": 0.7480916030534351,
"accuracy": 0.8196721311
},
"mcc": 0.6203113512927362,
"gain": 2.117855455833727,
"threshold": 0.6341791749000549,
"counts": {
"falseNegatives": 24.0,
"truePositives": 49.0,
"trueNegatives": 101.0,
"falsePositives": 9.0
},
"population": {
"included": 58.0,
"excluded": 125.0
}
},
"auc": 0.8176836861768365,
"bestF1Score": {
"pr": {
"recall": 0.6712328767123288,
"precision": 0.8448275862068966,
"f1Score": 0.7480916030534351,
"accuracy": 0.8196721311
},
"mcc": 0.6203113512927362,
"gain": 2.117855455833727,
"threshold": 0.6341791749000549,
"counts": {
"falseNegatives": 24.0,
"truePositives": 49.0,
"trueNegatives": 101.0,
"falsePositives": 9.0
},
"population": {
"included": 58.0,
"excluded": 125.0
}
}
},
"state": "finished"
}
The output
dataset created by this procedure in boolean
mode
will contain one row per score by grouping together test set rows
with the same score. The dataset will have the following columns:
score
: the score the classifier assigned to this rowlabel
: the row's actual labelweight
: the row's assigned weightfalseNegatives
, trueNegatives
, falsePositives
, truePositives
falsePositiveRate
, truePositiveRate
, precision
, recall
, accuracy
Note that rows with the same score get grouped together.
The status
field will contain a sparse confusion matrix along with performance
statistics (e.g. precision, recall)
for the classifier, where the label with the maximum score will be chosen as the
prediction for each example.
The value of support
is the number of occurrences
for that label. The weighted_statistics
represents the average of the
per-label statistics, weighted by each label's support value. This excludes
the support value itself, for which we do the sum.
Here is a sample output:
{
"status": {
"labelStatistics": {
"0": {
"f1Score": 0.8000000143051146,
"recall": 1.0,
"support": 2,
"precision": 0.6666666865348816,
"accuracy": 1.0
},
"1": {
"f1Score": 0.0,
"recall": 0.0,
"support": 1,
"precision": 0.0,
"accuracy": 0.0
},
"2": {
"f1Score": 1.0,
"recall": 1.0,
"support": 2,
"precision": 1.0,
"accuracy": 1.0
}
},
"weightedStatistics": {
"f1Score": 0.7200000057220459,
"recall": 0.8,
"support": 5,
"precision": 0.6666666746139527,
"accuracy": 0.8
},
"confusionMatrix": [
{"predicted": "0", "actual": "1", "count": 1},
{"predicted": "0", "actual": "0", "count": 2},
{"predicted": "2", "actual": "2", "count": 2}
]
},
"state": "finished"
}
The output
dataset created by this procedure in categorical
mode
will contain one row per input row,
with the same row name as in the input, and the following columns:
label
: the row's actual labelweight
: the row's assigned weightscore.x
: the score the classifier assigned to this row for label x
maxLabel
: the label with the maximum scoreThe status
field will contain the following performance statistics:
Here is a sample output:
{
"status": {
"quantileErrors": {
"0.25": 0.0,
"0.5": 0.1428571428571428,
"0.75": 0.1666666666666667,
"0.9": 0.1666666666666667
},
"mse": 0.375,
"r2": 0.9699681653424412
},
"state": "finished"
}
The output
dataset created by this procedure in regression
mode
will contain one row per input row,
with the same row name as in the input, and the following columns:
label
: the row's actual valuescore
: predicted valueweight
: the row's assigned weightThe status
field will contain the recall among top-N performance
statistic (e.g. recall)
for the classifier, where the labels with the highest score will be chosen as the
predictions for each example. I.e, each example will register a positive for
each of its label(s) that is found among the set(s) of highest scoring labels returned by the classifier.
The size of this set is determined by the recallOverN
parameter. It follows that a 1.0
recall rate cannot
be obtained if any example contains more unique labels than the value of the recallOverN
parameter.
The weighted_statistics
represents the average of the
per-label statistics.
Here is a sample output, with recallOverN
set to [3, 5]
, to calculate the recall over the top 3 and 5 best labels,
respectively:
{
"status": {
"weightedStatistics": {
"recallOverTopN": [0.6666666666666666, 1.0]
},
"labelStatistics": {
"label0": {
"recallOverTopN": [0.3333333333333333, 1.0]
},
"label1": {
"recallOverTopN": [0.6666666666666666, 1.0]
},
"label2": {
"recallOverTopN": [1.0, 1.0]
}
}
},
"state": "finished"
}
This indicates that in this example most true-positive labels are found with the top 3, and
all within the top 5. For label2
, they are always found within the top 3.
The output
dataset created by this procedure in multilabel
mode
will contain one row per input row,
with the same row name as in the input, and the following columns:
label
: the row's actual labelweight
: the row's assigned weightscore.x
: the score the classifier assigned to this row for label x
maxLabel
: the label with the maximum scoreclassifier.train
procedure type trains a classifier.classifier.test
procedure type allows the accuracy of a predictor to be tested against
held-out data.probabilizer.train
procedure type trains a probabilizer.classifier
function type applies a classifier to a feature vector, producing a classification score.classifier.explain
function type explains how a classifier produced its output.probabilizer
function type works with classifier.apply to convert scores to probabilities.