TF-IDF Procedure

The TF-IDF procedure trains the data to use a TF-IDF function. This function is used to find how relevant certain words are to a document, by combining the term frequency (TF), i.e how frequent the term is in the document, with the inverse document frequency (IDF), i.e how frequent a term appears in a reference corpus.

To apply TF-IDF weighting to a corpus, this procedure needs to be run to produce the document frequency dataset. The term weighting can then be applied by loading it with the tfidf function type.

Configuration

A new procedure of type tfidf.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "tfidf.train",
    "params": {
        "trainingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

An SQL query to provide for input to the tfidf procedure. Rows represent documents, and column names are terms. If a cell contains anything other than null the document will be take to contain the term. Note that this procedure will not normalize the terms in any ways: for example, terms with different capitalization 'Montreal', 'montreal' or with accented characters 'Montréal' will all be considered to be different terms.

outputDataset
OutputDatasetSpec (Optional)
{"type":"sparse.mutable"}

This dataset will contain one row for each term in the input. The row name will be the term and the column count will contain the number of documents containing the term.

modelFileUrl
Url

URL where the model file (with extension '.idf') should be saved. This file can be loaded by the tfidf function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the tfidf function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Input and Output Values

In the input dataset of the procedure, each row is a document and each column is a term, with the value being something other than 0 if the term appears in the document. This can be prepared using the tokenize function or any other method.

In the output dataset, a single row is added, with the columns being each term present in the corpus, and the value being the number of documents the term appears in.

See also