The TF-IDF procedure trains the data to use a TF-IDF function. This function is used to find how relevant certain words are to a document, by combining the term frequency (TF), i.e how frequent the term is in the document, with the inverse document frequency (IDF), i.e how frequent a term appears in a reference corpus.
To apply TF-IDF weighting to a corpus, this procedure needs to be run to produce
the document frequency dataset. The term weighting can then be applied by
loading it with the tfidf
function type.
A new procedure of type tfidf.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "tfidf.train",
"params": {
"trainingData": <InputQuery>,
"outputDataset": <OutputDatasetSpec (Optional)>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | An SQL query to provide for input to the tfidf procedure. Rows represent documents, and column names are terms. If a cell contains anything other than |
outputDataset | This dataset will contain one row for each term in the input. The row name will be the term and the column |
modelFileUrl | URL where the model file (with extension '.idf') should be saved. This file can be loaded by the |
functionName | If specified, an instance of the |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
In the input dataset of the procedure, each row is a document and each column is a term, with the value being something other than 0 if the term appears in the document. This can be prepared using the tokenize function or any other method.
In the output dataset, a single row is added, with the columns being each term present in the corpus, and the value being the number of documents the term appears in.
tfidf
function type is used to find how relevant certain words are to a document.