Summary Statistics Procedure

This procedure generates summary statistics for every column of the input dataset. Each column from the inputData will be represented as a row in the outputDataset.

Statistics for numeric columns

minimum and maximum values
mean
1st quartile, median, and 3rd quartile
number of unique values
number of null values
most frequent items

Statistics for categorical columns

Mixed or non numeric columns are treated as categorical and the statistics are:

number of unique values
number of null values
most frequent items

Configuration

A new procedure of type summary.statistics named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "summary.statistics",
    "params": {
        "inputData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
inputData InputQuery	An SQL statement to select the input data. The query must not contain GROUP BY or HAVING clauses and, unlike most select expressions, this one can only select whole columns, not expressions involving columns. So X will work, but not X + 1. If you need derived values in the query, create a dataset with the derived columns as a previous step and use a query on that dataset instead.
outputDataset OutputDatasetSpec `{"type":"sparse.mutable"}`	Output dataset configuration. This may refer either to an existing dataset, or a fully specified but non-existing dataset which will be created by the procedure.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Examples

The Predicting Titanic Survival demo notebook