Classifier Testing Procedure

The classifier testing procedure allows the accuracy of a binary classifier, multi-class classifier or regressor to be tested against held-out data. The output of this procedure is a dataset which contains the scores and statistics resulting from the application of a classifier to some input data.

Configuration

A new procedure of type classifier.test named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "classifier.test",
    "params": {
        "mode": <ClassifierMode>,
        "testingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "uniqueScoresOnly": <bool>,
        "recallOverN": <ARRAY [ int ]>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

mode
ClassifierMode
"boolean"

Model mode: boolean, regression or categorical. Controls how the label is interpreted and what is the output of the classifier. This must match what was used during training.

testingData
InputQuery

SQL query which specifies the scores, labels and optional weights for evaluation. The query is usually of the form: select classifier_function({features: {f1, f2}})[score] as score, x as label from ds.

The select expression must contain these two columns:

  • score: one scalar expression which evaluates to the score a classifier has assigned to the given row, and
  • label: one scalar expression to identify the row's label, and whose type must match that of the classifier mode. Rows with null labels will be ignored.
    • boolean mode: a boolean (0 or 1)
    • regression mode: a real number
    • categorical mode: any combination of numbers and strings for

The select expression can contain an optional weight column. The weight allows the relative importance of examples to be set. It must be a real number. If the weight is not specified each row will have a weight of 1. Rows with a null weight will cause a training error.

The query must not contain GROUP BY or HAVING clauses.

outputDataset
OutputDatasetSpec (Optional)
{"type":"tabular"}

Output dataset for scored examples. The score for the test example will be written to this dataset. Examples get grouped when they have the same score when mode is boolean. Specifying a dataset is optional.

uniqueScoresOnly
bool
false

If outputDataset is set and mode is set to boolean, setting this parameter to true will output a single row per unique score. This is useful if the test set is very large and aggregate statistics for each unique score is sufficient, for instance to generate a ROC curve. This has no effect for other values of mode.

recallOverN
ARRAY [ int ]

Calculate a recall score over the top scoring labels.Does not apply to boolean or regression modes.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Output

After this procedure has been run, a summary of the accuracy can be obtained via

mldb.get("/v1/procedures/<id>/runs/<runid>")

The status field will contain statistics relevant to the model's mode.

Boolean mode

The status field will contain the Area Under the Curve under the key auc, along with the performance statistics (e.g. precision, recall) for the classifier using the thresholds which give the best MCC and the best F-score, under the keys bestMcc and bestF1Score, respectively.

Here is a sample output:

{
  "status": {
    "bestMcc": {
      "pr": {
        "recall": 0.6712328767123288, 
        "precision": 0.8448275862068966, 
        "f1Score": 0.7480916030534351,
        "accuracy": 0.8196721311
      }, 
      "mcc": 0.6203113512927362, 
      "gain": 2.117855455833727, 
      "threshold": 0.6341791749000549, 
      "counts": {
        "falseNegatives": 24.0, 
        "truePositives": 49.0, 
        "trueNegatives": 101.0, 
        "falsePositives": 9.0
      }, 
      "population": {
        "included": 58.0, 
        "excluded": 125.0
      }
    }, 
    "auc": 0.8176836861768365, 
    "bestF1Score": {
      "pr": {
        "recall": 0.6712328767123288, 
        "precision": 0.8448275862068966, 
        "f1Score": 0.7480916030534351,
        "accuracy": 0.8196721311
      }, 
      "mcc": 0.6203113512927362, 
      "gain": 2.117855455833727, 
      "threshold": 0.6341791749000549, 
      "counts": {
        "falseNegatives": 24.0, 
        "truePositives": 49.0, 
        "trueNegatives": 101.0, 
        "falsePositives": 9.0
      }, 
      "population": {
        "included": 58.0, 
        "excluded": 125.0
      }
    }
  }, 
  "state": "finished"
}

The output dataset created by this procedure in boolean mode will contain one row per score by grouping together test set rows with the same score. The dataset will have the following columns:

Note that rows with the same score get grouped together.

Categorical mode

The status field will contain a sparse confusion matrix along with performance statistics (e.g. precision, recall) for the classifier, where the label with the maximum score will be chosen as the prediction for each example.

The value of support is the number of occurrences for that label. The weighted_statistics represents the average of the per-label statistics, weighted by each label's support value. This excludes the support value itself, for which we do the sum.

Here is a sample output:

{
    "status": {
        "labelStatistics": {
            "0": {
                "f1Score": 0.8000000143051146,
                "recall": 1.0,
                "support": 2,
                "precision": 0.6666666865348816,
                "accuracy": 1.0
            },
            "1": {
                "f1Score": 0.0,
                "recall": 0.0,
                "support": 1,
                "precision": 0.0,
                "accuracy": 0.0
            },
            "2": {
                "f1Score": 1.0,
                "recall": 1.0,
                "support": 2,
                "precision": 1.0,
                "accuracy": 1.0
            }
        },
        "weightedStatistics": {
            "f1Score": 0.7200000057220459,
            "recall": 0.8,
            "support": 5,
            "precision": 0.6666666746139527,
            "accuracy": 0.8
        },
        "confusionMatrix": [
            {"predicted": "0", "actual": "1", "count": 1},
            {"predicted": "0", "actual": "0", "count": 2},
            {"predicted": "2", "actual": "2", "count": 2}
        ]
    },
    "state": "finished"
}

The output dataset created by this procedure in categorical mode will contain one row per input row, with the same row name as in the input, and the following columns:

Regression mode

The status field will contain the following performance statistics:

Here is a sample output:

{
    "status": {
        "quantileErrors": {
            "0.25": 0.0,
            "0.5": 0.1428571428571428,
            "0.75": 0.1666666666666667,
            "0.9": 0.1666666666666667
        },
        "mse": 0.375,
        "r2": 0.9699681653424412
    },
    "state": "finished"
}

The output dataset created by this procedure in regression mode will contain one row per input row, with the same row name as in the input, and the following columns:

Multi-label mode

The status field will contain the recall among top-N performance statistic (e.g. recall) for the classifier, where the labels with the highest score will be chosen as the predictions for each example. I.e, each example will register a positive for each of its label(s) that is found among the set(s) of highest scoring labels returned by the classifier. The size of this set is determined by the recallOverN parameter. It follows that a 1.0 recall rate cannot be obtained if any example contains more unique labels than the value of the recallOverN parameter.

The weighted_statistics represents the average of the per-label statistics.

Here is a sample output, with recallOverN set to [3, 5], to calculate the recall over the top 3 and 5 best labels, respectively:

{
    "status": {
                "weightedStatistics": {
                    "recallOverTopN": [0.6666666666666666, 1.0]
                },
                "labelStatistics": {
                    "label0": {
                        "recallOverTopN": [0.3333333333333333, 1.0]
                    },
                    "label1": {
                        "recallOverTopN": [0.6666666666666666, 1.0]
                    },
                    "label2": {
                        "recallOverTopN": [1.0, 1.0]
                    }
                }
            },
    "state": "finished"
}

This indicates that in this example most true-positive labels are found with the top 3, and all within the top 5. For label2, they are always found within the top 3.

The output dataset created by this procedure in multilabel mode will contain one row per input row, with the same row name as in the input, and the following columns:

Examples

See also