Classifier Testing Procedure

The classifier testing procedure allows the accuracy of a binary classifier, multi-class classifier or regressor to be tested against held-out data. The output of this procedure is a dataset which contains the scores and statistics resulting from the application of a classifier to some input data.

Configuration

A new procedure of type classifier.test named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "classifier.test",
    "params": {
        "mode": <ClassifierMode>,
        "testingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "uniqueScoresOnly": <bool>,
        "recallOverN": <ARRAY [ int ]>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
mode ClassifierMode `"boolean"`	Model mode: `boolean`, `regression` or `categorical`. Controls how the label is interpreted and what is the output of the classifier. This must match what was used during training.
testingData InputQuery	SQL query which specifies the scores, labels and optional weights for evaluation. The query is usually of the form: `select classifier_function({features: {f1, f2}})[score] as score, x as label from ds`. The select expression must contain these two columns: `score`: one scalar expression which evaluates to the score a classifier has assigned to the given row, and `label`: one scalar expression to identify the row's label, and whose type must match that of the classifier mode. Rows with null labels will be ignored. `boolean` mode: a boolean (0 or 1) `regression` mode: a real number `categorical` mode: any combination of numbers and strings for The select expression can contain an optional `weight` column. The weight allows the relative importance of examples to be set. It must be a real number. If the `weight` is not specified each row will have a weight of 1. Rows with a null weight will cause a training error. The query must not contain `GROUP BY` or `HAVING` clauses.
outputDataset OutputDatasetSpec (Optional) `{"type":"tabular"}`	Output dataset for scored examples. The score for the test example will be written to this dataset. Examples get grouped when they have the same score when `mode` is `boolean`. Specifying a dataset is optional.
uniqueScoresOnly bool `false`	If `outputDataset` is set and `mode` is set to `boolean`, setting this parameter to `true` will output a single row per unique score. This is useful if the test set is very large and aggregate statistics for each unique score is sufficient, for instance to generate a ROC curve. This has no effect for other values of `mode`.
recallOverN ARRAY [ int ]	Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Output

After this procedure has been run, a summary of the accuracy can be obtained via

mldb.get("/v1/procedures/<id>/runs/<runid>")

The status field will contain statistics relevant to the model's mode.

Boolean mode
Categorical mode
Regression mode
Multi-label mode

Boolean mode

The status field will contain the Area Under the Curve under the key auc, along with the performance statistics (e.g. precision, recall) for the classifier using the thresholds which give the best MCC and the best F-score, under the keys bestMcc and bestF1Score, respectively.

Here is a sample output:

{
  "status": {
    "bestMcc": {
      "pr": {
        "recall": 0.6712328767123288, 
        "precision": 0.8448275862068966, 
        "f1Score": 0.7480916030534351,
        "accuracy": 0.8196721311
      }, 
      "mcc": 0.6203113512927362, 
      "gain": 2.117855455833727, 
      "threshold": 0.6341791749000549, 
      "counts": {
        "falseNegatives": 24.0, 
        "truePositives": 49.0, 
        "trueNegatives": 101.0, 
        "falsePositives": 9.0
      }, 
      "population": {
        "included": 58.0, 
        "excluded": 125.0
      }
    }, 
    "auc": 0.8176836861768365, 
    "bestF1Score": {
      "pr": {
        "recall": 0.6712328767123288, 
        "precision": 0.8448275862068966, 
        "f1Score": 0.7480916030534351,
        "accuracy": 0.8196721311
      }, 
      "mcc": 0.6203113512927362, 
      "gain": 2.117855455833727, 
      "threshold": 0.6341791749000549, 
      "counts": {
        "falseNegatives": 24.0, 
        "truePositives": 49.0, 
        "trueNegatives": 101.0, 
        "falsePositives": 9.0
      }, 
      "population": {
        "included": 58.0, 
        "excluded": 125.0
      }
    }
  }, 
  "state": "finished"
}

The output dataset created by this procedure in boolean mode will contain one row per score by grouping together test set rows with the same score. The dataset will have the following columns:

score: the score the classifier assigned to this row
label: the row's actual label
weight: the row's assigned weight
classifier attributes if this row's score was used as a binary threshold:
- falseNegatives, trueNegatives, falsePositives, truePositives
- falsePositiveRate, truePositiveRate, precision, recall, accuracy

Note that rows with the same score get grouped together.

Categorical mode

The status field will contain a sparse confusion matrix along with performance statistics (e.g. precision, recall) for the classifier, where the label with the maximum score will be chosen as the prediction for each example.

The value of support is the number of occurrences for that label. The weighted_statistics represents the average of the per-label statistics, weighted by each label's support value. This excludes the support value itself, for which we do the sum.

Here is a sample output:

{
    "status": {
        "labelStatistics": {
            "0": {
                "f1Score": 0.8000000143051146,
                "recall": 1.0,
                "support": 2,
                "precision": 0.6666666865348816,
                "accuracy": 1.0
            },
            "1": {
                "f1Score": 0.0,
                "recall": 0.0,
                "support": 1,
                "precision": 0.0,
                "accuracy": 0.0
            },
            "2": {
                "f1Score": 1.0,
                "recall": 1.0,
                "support": 2,
                "precision": 1.0,
                "accuracy": 1.0
            }
        },
        "weightedStatistics": {
            "f1Score": 0.7200000057220459,
            "recall": 0.8,
            "support": 5,
            "precision": 0.6666666746139527,
            "accuracy": 0.8
        },
        "confusionMatrix": [
            {"predicted": "0", "actual": "1", "count": 1},
            {"predicted": "0", "actual": "0", "count": 2},
            {"predicted": "2", "actual": "2", "count": 2}
        ]
    },
    "state": "finished"
}

The output dataset created by this procedure in categorical mode will contain one row per input row, with the same row name as in the input, and the following columns:

label: the row's actual label
weight: the row's assigned weight
score.x: the score the classifier assigned to this row for label x
maxLabel: the label with the maximum score

Regression mode

The status field will contain the following performance statistics:

mean squared error (MSE)
R squared score
Quantiles of errors: More robust to outliers than MSE. Given \( y_i \) the true value and \( \hat{y}_i \) the predicted value, we return the 25th, 50th, 75th, and 90th percentile of \( | y_i - \hat{y}_i | / y_i \forall i \). The 50th percentile is the median and represents the median absolute percentage (MAPE).

Here is a sample output:

{
    "status": {
        "quantileErrors": {
            "0.25": 0.0,
            "0.5": 0.1428571428571428,
            "0.75": 0.1666666666666667,
            "0.9": 0.1666666666666667
        },
        "mse": 0.375,
        "r2": 0.9699681653424412
    },
    "state": "finished"
}

The output dataset created by this procedure in regression mode will contain one row per input row, with the same row name as in the input, and the following columns:

label: the row's actual value
score: predicted value
weight: the row's assigned weight

Multi-label mode

The status field will contain the recall among top-N performance statistic (e.g. recall) for the classifier, where the labels with the highest score will be chosen as the predictions for each example. I.e, each example will register a positive for each of its label(s) that is found among the set(s) of highest scoring labels returned by the classifier. The size of this set is determined by the recallOverN parameter. It follows that a 1.0 recall rate cannot be obtained if any example contains more unique labels than the value of the recallOverN parameter.

The weighted_statistics represents the average of the per-label statistics.

Here is a sample output, with recallOverN set to [3, 5], to calculate the recall over the top 3 and 5 best labels, respectively:

{
    "status": {
                "weightedStatistics": {
                    "recallOverTopN": [0.6666666666666666, 1.0]
                },
                "labelStatistics": {
                    "label0": {
                        "recallOverTopN": [0.3333333333333333, 1.0]
                    },
                    "label1": {
                        "recallOverTopN": [0.6666666666666666, 1.0]
                    },
                    "label2": {
                        "recallOverTopN": [1.0, 1.0]
                    }
                }
            },
    "state": "finished"
}

This indicates that in this example most true-positive labels are found with the top 3, and all within the top 5. For label2, they are always found within the top 3.

The output dataset created by this procedure in multilabel mode will contain one row per input row, with the same row name as in the input, and the following columns:

label: the row's actual label
weight: the row's assigned weight
score.x: the score the classifier assigned to this row for label x
maxLabel: the label with the maximum score

Examples

Boolean mode: the Predicting Titanic Survival demo notebook
Categorical mode: the Procedures and Functions Tutorial notebook

Classifier Testing Procedure

Configuration

Output

Boolean mode

Categorical mode

Regression mode

Multi-label mode

Examples

See also