Bag Of Words Stats Table Procedure

This procedure type is meant to work with bags of words, as returned by the tokenize function. It creates a statistical table to track the co-occurrence of each word with the specified outcome, over all rows in a table.

It is related to the statsTable.train procedure type but is different in the sense that the statsTable.train procedure is meant to operate on a dense dataset composed of a fixed number of columns where each column will have its own stats table. This procedure treats columns as words in a document and trains a single stats table for all words.

The resulting statistical table can be persisted using the statsTableFileUrl parameter and used later on to lookup counts using the statsTable.bagOfWords.posneg function type.

Configuration

A new procedure of type statsTable.bagOfWords.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "statsTable.bagOfWords.train",
    "params": {
        "trainingData": <InputQuery>,
        "outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
        "statsTableFileUrl": <Url>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "functionName": <string>,
        "functionOutcomeToUse": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	SQL query to select the data on which the rolling operations will be performed.
outcomes ARRAY [ TUPLE [ string, string ] ]	List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1)
statsTableFileUrl Url	URL where the model file (with extension '.st') should be saved. This file can be loaded by the `statsTable.bagOfWords.posneg` function type. This parameter is optional unless the `functionName` parameter is used.
outputDataset OutputDatasetSpec (Optional) `{"type":"tabular"}`	Output dataset with the total counts for each word along with the cooccurrence count with each outcome.
functionName string	If specified, an instance of the `statsTable.bagOfWords.posneg` function type of this name will be created using the trained stats tables and that function type's default parameters. Note that to use this parameter, the `statsTableFileUrl` must also be provided.
functionOutcomeToUse string	When `functionName` is provided, an instance of the `statsTable.bagOfWords.posneg` function type with the outcome of this name will be created. This parameter represents the `outcomeToUse` field of the `statsTable.bagOfWords.posneg` function type.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Example

Suppose we have the following dataset called text_dataset:

rowName	text	label
a	i like apples	1
b	i like juice
c	what about bananas?
d	apples are red	1
e	bananas are yellow
f	oranges are ... orange

If we run the following procedure:

mldb.put("/v1/procedures/my_st", {
    "type": "statsTable.bagOfWords.train",
    "params": {
        "trainingData": "SELECT tokenize(text, {splitChars: ' '}) as * FROM text_dataset",
        "outcomes": [["label", "label IS NOT NULL"]],
        "statsTableFileUrl": "file://my_st.st"
    }
})

Below are some examples of the values in the resulting stats table:

word	P(label)
apples	1
are	0.333
i	0.5
oranges	0

Bag Of Words Stats Table Procedure

Configuration

Example

See also