Stats Table Procedure

This procedure does a rolling sum of keys and one or more outcomes over multiple columns. It is the result of creating statistical tables, one per column, to track co-occurrence of each key with each outcome, over all rows in a table.

This procedure is related to the the statsTable.bagOfWords.train procedure type but is different in the sense that the statsTable.bagOfWords.train procedure is meant to operate on sparse bags of words. This procedure expects a dense input dataset with a fixed number of columns. Each column will have its own stats table. The keys used in the stats table will be the value of each cell.

When using this procedure, a new output dataset will be created. It will have the same number of rows as the input dataset. The counts in each row of the output dataset will represent the counts in the stats tables at that point in the input dataset while going sequentially through it.

The resulting statistical tables can be persisted using the statsTableFileUrl parameter and used later on to lookup counts using the statsTable.getCounts function type.

Configuration

A new procedure of type statsTable.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "statsTable.train",
    "params": {
        "trainingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec>,
        "outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
        "statsTableFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	SQL query to select the data on which the rolling operations will be performed.
outputDataset OutputDatasetSpec `{"type":"sparse.mutable"}`	Output dataset
outcomes ARRAY [ TUPLE [ string, string ] ]	List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1)
statsTableFileUrl Url	URL where the model file (with extension '.st') should be saved. This file can be loaded by the `statsTable.getCounts` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `statsTable.getCounts` function type of this name will be created using the trained stats tables. Note that to use this parameter, the `statsTableFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Example

Let's assume a dataset made up of data from a real-time bidding online campaign. Each row represents a bid request and each column represents a field in the bid request. We're interested in tracking statistics for each possible values of each field to determine which ones are strongly correlated with the outcomes. The outcomes we want to track are whether there was a click on the ad and/or a purchase was then made on the advertiser's website.

rowName	host	region	click_on_ad	purchase_value
br_1	patate.com	qc	1	0
br_2	carotte.net	on	1	25
br_3	patate.com	on	0	12
br_4	carotte.net	on	0	0

Assume this partial configuration:

{
    "trainingData": "* EXCLUDING(click_on_ad, purchase_value)",
    "outcomes": [
        ["click", "click_on_ad = 1"],
        ["purchase", "click_on_ad = 1 AND purchase_value > 0"]
    ]
}

The output dataset will contain:

rowName	trial-host	purchase-host	trial-region	click-region	purchase-region
br_1	0	0	0	0	0
br_2	0	0	0	0	0
br_3	1	1	1	1	0
br_4	1	1	2	1	1

Note that the effect of a row on the counts is taken into account after the counts for that specific row are generated. This is done to prevent introducing bias as the values for the current row would take into account the row's outcomes.

Stats Table Procedure

Configuration

Example

See also