The classifier testing procedure allows the accuracy of a binary classifier, multi-class classifier or regressor to be tested against held-out data. The output of this procedure is a dataset which contains the scores and statistics resulting from the application of a classifier to some input data.

A new procedure of type `classifier.test`

named `<id>`

can be created as follows:

```
mldb.put("/v1/procedures/"+<id>, {
"type": "classifier.test",
"params": {
"mode": <ClassifierMode>,
"testingData": <InputQuery>,
"outputDataset": <OutputDatasetSpec (Optional)>,
"uniqueScoresOnly": <bool>,
"recallOverN": <ARRAY [ int ]>,
"runOnCreation": <bool>
}
})
```

with the following key-value definitions for `params`

:

Field, Type, Default | Description |
---|---|

| Model mode: |

| SQL query which specifies the scores, labels and optional weights for evaluation. The query is usually of the form: The select expression must contain these two columns: `score` : one scalar expression which evaluates to the score a classifier has assigned to the given row, and`label` : one scalar expression to identify the row's label, and whose type must match that of the classifier mode. Rows with null labels will be ignored.`boolean` mode: a boolean (0 or 1)`regression` mode: a real number`categorical` mode: any combination of numbers and strings for
The select expression can contain an optional The query must not contain |

| Output dataset for scored examples. The score for the test example will be written to this dataset. Examples get grouped when they have the same score when |

| If |

| Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes. |

| If true, the procedure will be run immediately. The response will contain an extra field called |

After this procedure has been run, a summary of the accuracy can be obtained via

```
mldb.get("/v1/procedures/<id>/runs/<runid>")
```

The `status`

field will contain statistics relevant to the model's mode.

The `status`

field will contain the Area
Under the Curve under the key `auc`

, along with the performance statistics (e.g. precision, recall) for the classifier using
the thresholds which give the best MCC and the best F-score, under the keys `bestMcc`

and
`bestF1Score`

, respectively.

Here is a sample output:

```
{
"status": {
"bestMcc": {
"pr": {
"recall": 0.6712328767123288,
"precision": 0.8448275862068966,
"f1Score": 0.7480916030534351,
"accuracy": 0.8196721311
},
"mcc": 0.6203113512927362,
"gain": 2.117855455833727,
"threshold": 0.6341791749000549,
"counts": {
"falseNegatives": 24.0,
"truePositives": 49.0,
"trueNegatives": 101.0,
"falsePositives": 9.0
},
"population": {
"included": 58.0,
"excluded": 125.0
}
},
"auc": 0.8176836861768365,
"bestF1Score": {
"pr": {
"recall": 0.6712328767123288,
"precision": 0.8448275862068966,
"f1Score": 0.7480916030534351,
"accuracy": 0.8196721311
},
"mcc": 0.6203113512927362,
"gain": 2.117855455833727,
"threshold": 0.6341791749000549,
"counts": {
"falseNegatives": 24.0,
"truePositives": 49.0,
"trueNegatives": 101.0,
"falsePositives": 9.0
},
"population": {
"included": 58.0,
"excluded": 125.0
}
}
},
"state": "finished"
}
```

The `output`

dataset created by this procedure in `boolean`

mode
will contain one row per score by grouping together test set rows
with the same score. The dataset will have the following columns:

`score`

: the score the classifier assigned to this row`label`

: the row's actual label`weight`

: the row's assigned weight- classifier attributes if this row's score was used as a binary threshold:
`falseNegatives`

,`trueNegatives`

,`falsePositives`

,`truePositives`

`falsePositiveRate`

,`truePositiveRate`

,`precision`

,`recall`

,`accuracy`

Note that rows with the same score get grouped together.

The `status`

field will contain a sparse confusion matrix along with performance
statistics (e.g. precision, recall)
for the classifier, where the label with the maximum score will be chosen as the
prediction for each example.

The value of `support`

is the number of occurrences
for that label. The `weighted_statistics`

represents the average of the
per-label statistics, weighted by each label's support value. This excludes
the support value itself, for which we do the sum.

Here is a sample output:

```
{
"status": {
"labelStatistics": {
"0": {
"f1Score": 0.8000000143051146,
"recall": 1.0,
"support": 2,
"precision": 0.6666666865348816,
"accuracy": 1.0
},
"1": {
"f1Score": 0.0,
"recall": 0.0,
"support": 1,
"precision": 0.0,
"accuracy": 0.0
},
"2": {
"f1Score": 1.0,
"recall": 1.0,
"support": 2,
"precision": 1.0,
"accuracy": 1.0
}
},
"weightedStatistics": {
"f1Score": 0.7200000057220459,
"recall": 0.8,
"support": 5,
"precision": 0.6666666746139527,
"accuracy": 0.8
},
"confusionMatrix": [
{"predicted": "0", "actual": "1", "count": 1},
{"predicted": "0", "actual": "0", "count": 2},
{"predicted": "2", "actual": "2", "count": 2}
]
},
"state": "finished"
}
```

The `output`

dataset created by this procedure in `categorical`

mode
will contain one row per input row,
with the same row name as in the input, and the following columns:

`label`

: the row's actual label`weight`

: the row's assigned weight`score.x`

: the score the classifier assigned to this row for label`x`

`maxLabel`

: the label with the maximum score

The `status`

field will contain the following performance statistics:

- mean squared error (MSE)
- R squared score
- Quantiles of errors: More robust to outliers than MSE. Given \( y_i \) the true
value and \( \hat{y}_i \) the predicted value, we return the 25th, 50th, 75th, and 90th
percentile of \( | y_i - \hat{y}_i | / y_i \forall i \). The 50th percentile is
the median and represents the
*median absolute percentage*(MAPE).

Here is a sample output:

```
{
"status": {
"quantileErrors": {
"0.25": 0.0,
"0.5": 0.1428571428571428,
"0.75": 0.1666666666666667,
"0.9": 0.1666666666666667
},
"mse": 0.375,
"r2": 0.9699681653424412
},
"state": "finished"
}
```

The `output`

dataset created by this procedure in `regression`

mode
will contain one row per input row,
with the same row name as in the input, and the following columns:

`label`

: the row's actual value`score`

: predicted value`weight`

: the row's assigned weight

The `status`

field will contain the recall among top-N performance
statistic (e.g. recall)
for the classifier, where the labels with the highest score will be chosen as the
predictions for each example. I.e, each example will register a positive for
each of its label(s) that is found among the set(s) of highest scoring labels returned by the classifier.
The size of this set is determined by the `recallOverN`

parameter. It follows that a `1.0`

recall rate cannot
be obtained if any example contains more unique labels than the value of the `recallOverN`

parameter.

The `weighted_statistics`

represents the average of the
per-label statistics.

Here is a sample output, with `recallOverN`

set to `[3, 5]`

, to calculate the recall over the top 3 and 5 best labels,
respectively:

```
{
"status": {
"weightedStatistics": {
"recallOverTopN": [0.6666666666666666, 1.0]
},
"labelStatistics": {
"label0": {
"recallOverTopN": [0.3333333333333333, 1.0]
},
"label1": {
"recallOverTopN": [0.6666666666666666, 1.0]
},
"label2": {
"recallOverTopN": [1.0, 1.0]
}
}
},
"state": "finished"
}
```

This indicates that in this example most true-positive labels are found with the top 3, and
all within the top 5. For `label2`

, they are always found within the top 3.

The `output`

dataset created by this procedure in `multilabel`

mode
will contain one row per input row,
with the same row name as in the input, and the following columns:

`label`

: the row's actual label`weight`

: the row's assigned weight`score.x`

: the score the classifier assigned to this row for label`x`

`maxLabel`

: the label with the maximum score

- Boolean mode: the Predicting Titanic Survival demo notebook
- Categorical mode: the Procedures and Functions Tutorial notebook

- The
`classifier.train`

procedure type trains a classifier. - The
`classifier.test`

procedure type allows the accuracy of a predictor to be tested against held-out data. - The
`probabilizer.train`

procedure type trains a probabilizer. - The
`classifier`

function type applies a classifier to a feature vector, producing a classification score. - The
`classifier.explain`

function type explains how a classifier produced its output. - The
`probabilizer`

function type works with classifier.apply to convert scores to probabilities.