Stats Table Procedure

This procedure does a rolling sum of keys and one or more outcomes over multiple columns. It is the result of creating statistical tables, one per column, to track co-occurrence of each key with each outcome, over all rows in a table.

This procedure is related to the the statsTable.bagOfWords.train procedure type but is different in the sense that the statsTable.bagOfWords.train procedure is meant to operate on sparse bags of words. This procedure expects a dense input dataset with a fixed number of columns. Each column will have its own stats table. The keys used in the stats table will be the value of each cell.

When using this procedure, a new output dataset will be created. It will have the same number of rows as the input dataset. The counts in each row of the output dataset will represent the counts in the stats tables at that point in the input dataset while going sequentially through it.

The resulting statistical tables can be persisted using the statsTableFileUrl parameter and used later on to lookup counts using the statsTable.getCounts function type.

Configuration

A new procedure of type statsTable.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "statsTable.train",
    "params": {
        "trainingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec>,
        "outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
        "statsTableFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

SQL query to select the data on which the rolling operations will be performed.

outputDataset
OutputDatasetSpec
{"type":"sparse.mutable"}

Output dataset

outcomes
ARRAY [ TUPLE [ string, string ] ]

List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1)

statsTableFileUrl
Url

URL where the model file (with extension '.st') should be saved. This file can be loaded by the statsTable.getCounts function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the statsTable.getCounts function type of this name will be created using the trained stats tables. Note that to use this parameter, the statsTableFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Example

Let's assume a dataset made up of data from a real-time bidding online campaign. Each row represents a bid request and each column represents a field in the bid request. We're interested in tracking statistics for each possible values of each field to determine which ones are strongly correlated with the outcomes. The outcomes we want to track are whether there was a click on the ad and/or a purchase was then made on the advertiser's website.

rowName host region click_on_ad purchase_value
br_1 patate.com qc 1 0
br_2 carotte.net on 1 25
br_3 patate.com on 0 12
br_4 carotte.net on 0 0

Assume this partial configuration:

{
    "trainingData": "* EXCLUDING(click_on_ad, purchase_value)",
    "outcomes": [
        ["click", "click_on_ad = 1"],
        ["purchase", "click_on_ad = 1 AND purchase_value > 0"]
    ]
}

The output dataset will contain:

rowName trial-host click-host purchase-host trial-region click-region purchase-region
br_1 0 0 0 0 0 0
br_2 0 0 0 0 0 0
br_3 1 0 1 1 1 0
br_4 1 0 1 2 1 1

Note that the effect of a row on the counts is taken into account after the counts for that specific row are generated. This is done to prevent introducing bias as the values for the current row would take into account the row's outcomes.

See also