Bag Of Words Stats Table Procedure

This procedure type is meant to work with bags of words, as returned by the tokenize function. It creates a statistical table to track the co-occurrence of each word with the specified outcome, over all rows in a table.

It is related to the statsTable.train procedure type but is different in the sense that the statsTable.train procedure is meant to operate on a dense dataset composed of a fixed number of columns where each column will have its own stats table. This procedure treats columns as words in a document and trains a single stats table for all words.

The resulting statistical table can be persisted using the statsTableFileUrl parameter and used later on to lookup counts using the statsTable.bagOfWords.posneg function type.

Configuration

A new procedure of type statsTable.bagOfWords.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "statsTable.bagOfWords.train",
    "params": {
        "trainingData": <InputQuery>,
        "outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
        "statsTableFileUrl": <Url>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "functionName": <string>,
        "functionOutcomeToUse": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

SQL query to select the data on which the rolling operations will be performed.

outcomes
ARRAY [ TUPLE [ string, string ] ]

List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1)

statsTableFileUrl
Url

URL where the model file (with extension '.st') should be saved. This file can be loaded by the statsTable.bagOfWords.posneg function type. This parameter is optional unless the functionName parameter is used.

outputDataset
OutputDatasetSpec (Optional)
{"type":"tabular"}

Output dataset with the total counts for each word along with the cooccurrence count with each outcome.

functionName
string

If specified, an instance of the statsTable.bagOfWords.posneg function type of this name will be created using the trained stats tables and that function type's default parameters. Note that to use this parameter, the statsTableFileUrl must also be provided.

functionOutcomeToUse
string

When functionName is provided, an instance of the statsTable.bagOfWords.posneg function type with the outcome of this name will be created. This parameter represents the outcomeToUse field of the statsTable.bagOfWords.posneg function type.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Example

Suppose we have the following dataset called text_dataset:

rowName text label
a i like apples 1
b i like juice
c what about bananas?
d apples are red 1
e bananas are yellow
f oranges are ... orange

If we run the following procedure:

mldb.put("/v1/procedures/my_st", {
    "type": "statsTable.bagOfWords.train",
    "params": {
        "trainingData": "SELECT tokenize(text, {splitChars: ' '}) as * FROM text_dataset",
        "outcomes": [["label", "label IS NOT NULL"]],
        "statsTableFileUrl": "file://my_st.st"
    }
})

Below are some examples of the values in the resulting stats table:

word P(label)
apples 1
are 0.333
i 0.5
oranges 0

See also