This procedure type is meant to work with bags of words, as returned by the
tokenize function. It creates a statistical table to track the co-occurrence of each
word with the specified outcome, over all rows in a table.
It is related to the statsTable.train procedure type but is different in the
sense that the statsTable.train procedure is meant to operate on a dense dataset composed of a
fixed number of columns where each column will have its own stats table. This procedure
treats columns as words in a document and trains a single stats table for all words.
The resulting statistical table can be persisted using the statsTableFileUrl parameter
and used later on to lookup counts using the statsTable.bagOfWords.posneg function type.
A new procedure of type statsTable.bagOfWords.train named <id> can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "statsTable.bagOfWords.train",
"params": {
"trainingData": <InputQuery>,
"outcomes": <ARRAY [ TUPLE [ string, string ] ]>,
"statsTableFileUrl": <Url>,
"outputDataset": <OutputDatasetSpec (Optional)>,
"functionName": <string>,
"functionOutcomeToUse": <string>,
"runOnCreation": <bool>
}
})with the following key-value definitions for params:
| Field, Type, Default | Description |
|---|---|
trainingData | SQL query to select the data on which the rolling operations will be performed. |
outcomes | List of expressions to generate the outcomes. Each can be any expression involving the columns in the dataset. The type of the outcomes must be a boolean (0 or 1) |
statsTableFileUrl | URL where the model file (with extension '.st') should be saved. This file can be loaded by the |
outputDataset | Output dataset with the total counts for each word along with the cooccurrence count with each outcome. |
functionName | If specified, an instance of the |
functionOutcomeToUse | When |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
Suppose we have the following dataset called text_dataset:
| rowName | text | label |
|---|---|---|
| a | i like apples | 1 |
| b | i like juice | |
| c | what about bananas? | |
| d | apples are red | 1 |
| e | bananas are yellow | |
| f | oranges are ... orange |
If we run the following procedure:
mldb.put("/v1/procedures/my_st", {
"type": "statsTable.bagOfWords.train",
"params": {
"trainingData": "SELECT tokenize(text, {splitChars: ' '}) as * FROM text_dataset",
"outcomes": [["label", "label IS NOT NULL"]],
"statsTableFileUrl": "file://my_st.st"
}
})
Below are some examples of the values in the resulting stats table:
| word | P(label) |
|---|---|
| apples | 1 |
| are | 0.333 |
| i | 0.5 |
| oranges | 0 |
statsTable.bagOfWords.posneg function type is used to do lookups in a stats table trained by this procedure and returning the matching top negative or positive words.statsTable.train procedure type is used to train stats tables on dense datasets