This procedure type is meant to work with bags of words, as returned by the
tokenize
function. It creates a statistical table to track the co-occurrence of each
word with the specified outcome, over all rows in a table.
It is related to the statsTable.train
procedure type but is different in the
sense that the statsTable.train
procedure is meant to operate on a dense dataset composed of a
fixed number of columns where each column will have its own stats table. This procedure
treats columns as words in a document and trains a single stats table for all words.
The resulting statistical table can be persisted using the statsTableFileUrl
parameter
and used later on to lookup counts using the statsTable.bagOfWords.posneg
function type.
A new procedure of type statsTable.bagOfWords.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "statsTable.bagOfWords.train",
"params": {
"trainingData": <InputQuery>,
"outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
"statsTableFileUrl": <Url>,
"outputDataset": <OutputDatasetSpec (Optional)>,
"functionName": <string>,
"functionOutcomeToUse": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | SQL query to select the data on which the rolling operations will be performed. |
outcomes | List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1) |
statsTableFileUrl | URL where the model file (with extension '.st') should be saved. This file can be loaded by the |
outputDataset | Output dataset with the total counts for each word along with the cooccurrence count with each outcome. |
functionName | If specified, an instance of the |
functionOutcomeToUse | When |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
Suppose we have the following dataset called text_dataset
:
rowName | text | label |
---|---|---|
a | i like apples | 1 |
b | i like juice | |
c | what about bananas? | |
d | apples are red | 1 |
e | bananas are yellow | |
f | oranges are ... orange |
If we run the following procedure:
mldb.put("/v1/procedures/my_st", {
"type": "statsTable.bagOfWords.train",
"params": {
"trainingData": "SELECT tokenize(text, {splitChars: ' '}) as * FROM text_dataset",
"outcomes": [["label", "label IS NOT NULL"]],
"statsTableFileUrl": "file://my_st.st"
}
})
Below are some examples of the values in the resulting stats table:
word | P(label) |
---|---|
apples | 1 |
are | 0.333 |
i | 0.5 |
oranges | 0 |
statsTable.bagOfWords.posneg
function type is used to do lookups in a stats table trained by this procedure and returning the matching top negative or positive words.statsTable.train
procedure type is used to train stats tables on dense datasets