TF-IDF Procedure

The TF-IDF procedure trains the data to use a TF-IDF function. This function is used to find how relevant certain words are to a document, by combining the term frequency (TF), i.e how frequent the term is in the document, with the inverse document frequency (IDF), i.e how frequent a term appears in a reference corpus.

To apply TF-IDF weighting to a corpus, this procedure needs to be run to produce the document frequency dataset. The term weighting can then be applied by loading it with the tfidf function type.

Configuration

A new procedure of type tfidf.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "tfidf.train",
    "params": {
        "trainingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	An SQL query to provide for input to the tfidf procedure. Rows represent documents, and column names are terms. If a cell contains anything other than `null` the document will be take to contain the term. Note that this procedure will not normalize the terms in any ways: for example, terms with different capitalization 'Montreal', 'montreal' or with accented characters 'Montréal' will all be considered to be different terms.
outputDataset OutputDatasetSpec (Optional) `{"type":"sparse.mutable"}`	This dataset will contain one row for each term in the input. The row name will be the term and the column `count` will contain the number of documents containing the term.
modelFileUrl Url	URL where the model file (with extension '.idf') should be saved. This file can be loaded by the `tfidf` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `tfidf` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Input and Output Values

In the input dataset of the procedure, each row is a document and each column is a term, with the value being something other than 0 if the term appears in the document. This can be prepared using the tokenize function or any other method.

In the output dataset, a single row is added, with the columns being each term present in the corpus, and the value being the number of documents the term appears in.

TF-IDF Procedure

Configuration

Input and Output Values

See also