This procedure does a rolling sum of keys and one or more outcomes over multiple columns. It is the result of creating statistical tables, one per column, to track co-occurrence of each key with each outcome, over all rows in a table.
This procedure is related to the the statsTable.bagOfWords.train
procedure type but
is different in the sense that the statsTable.bagOfWords.train
procedure is meant to
operate on sparse bags of words. This procedure expects a dense input dataset with a fixed
number of columns. Each column will have its own stats table. The keys used in the stats table
will be the value of each cell.
When using this procedure, a new output dataset will be created. It will have the same number of rows as the input dataset. The counts in each row of the output dataset will represent the counts in the stats tables at that point in the input dataset while going sequentially through it.
The resulting statistical tables can be persisted using the statsTableFileUrl
parameter
and used later on to lookup counts using the statsTable.getCounts
function type.
A new procedure of type statsTable.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "statsTable.train",
"params": {
"trainingData": <InputQuery>,
"outputDataset": <OutputDatasetSpec>,
"outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
"statsTableFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | SQL query to select the data on which the rolling operations will be performed. |
outputDataset | Output dataset |
outcomes | List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1) |
statsTableFileUrl | URL where the model file (with extension '.st') should be saved. This file can be loaded by the |
functionName | If specified, an instance of the |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
Let's assume a dataset made up of data from a real-time bidding online campaign. Each row represents a bid request and each column represents a field in the bid request. We're interested in tracking statistics for each possible values of each field to determine which ones are strongly correlated with the outcomes. The outcomes we want to track are whether there was a click on the ad and/or a purchase was then made on the advertiser's website.
rowName | host | region | click_on_ad | purchase_value |
---|---|---|---|---|
br_1 | patate.com | qc | 1 | 0 |
br_2 | carotte.net | on | 1 | 25 |
br_3 | patate.com | on | 0 | 12 |
br_4 | carotte.net | on | 0 | 0 |
Assume this partial configuration:
{
"trainingData": "* EXCLUDING(click_on_ad, purchase_value)",
"outcomes": [
["click", "click_on_ad = 1"],
["purchase", "click_on_ad = 1 AND purchase_value > 0"]
]
}
The output dataset will contain:
rowName | trial-host | click-host | purchase-host | trial-region | click-region | purchase-region |
---|---|---|---|---|---|---|
br_1 | 0 | 0 | 0 | 0 | 0 | 0 |
br_2 | 0 | 0 | 0 | 0 | 0 | 0 |
br_3 | 1 | 0 | 1 | 1 | 1 | 0 |
br_4 | 1 | 0 | 1 | 2 | 1 | 1 |
Note that the effect of a row on the counts is taken into account after the counts for that specific row are generated. This is done to prevent introducing bias as the values for the current row would take into account the row's outcomes.
statsTable.getCounts
function type does a lookup in stats tables for a row of keys.experimental.statsTable.derivedColumnsGenerator
procedure type can be used to generate derived columns from stats table counts.statsTable.bagOfWords.train
procedure type trains a stats table on bags of words.