The TF-IDF function is used to find how relevant certain words are to a document, by combining the term frequency (TF), i.e how frequent the term is in the document, with the inverse document frequency (IDF), i.e how frequent a term appears in a reference corpus.
A new function of type tfidf
named <id>
can be created as follows:
mldb.put("/v1/functions/"+<id>, {
"type": "tfidf",
"params": {
"modelFileUrl": <Url>,
"tfType": <TFType>,
"idfType": <IDFType>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
modelFileUrl | URL of the model file (with extension '.idf') to load. This file is created by the |
tfType | Type of TF scoring |
idfType | Type of IDF scoring |
Given a term \( t \), \( D \) the corpus, a document \( d \in D \) and the term frequency denoted by \(f_{t,d}\), the type of TF weighting schemes are:
weighting scheme | description | TF weight |
---|---|---|
raw |
the term frequency in the document | \(f_{t,d}\) |
log |
the logarithm of the term frequency in the document | \( \log(1 + f_{t,d}) \) |
augmented |
the half term frequency in the document divided by the maximum frequency of any term in the document | \( 0.5 + 0.5 \cdot \frac { f_{t,d} }{\max_{\{t' \in d\}} {f_{t',d}}} \) |
Given:
The types of IDF weighting schemes are:
weighting scheme | description | IDF weight |
---|---|---|
unary |
unary IDF score, i.e. don't use IDF | \( 1 \) |
inverse |
the logarithm of the number of documents in the corpus divided by the number of documents the term appears in (this will lead to negative scores for terms appearing in all documents in the corpus) | \( \log \left( \frac {N}{1 + n_t } \right) \) |
inverseSmooth |
similar to inverse but with 1 added to the logarithmic term (this ensures that the score will never be negative) |
\( \log \left( 1 + \frac {N}{1 + n_t } \right) \) |
inverseMax |
similar to inverseSmooth but using the maximum term frequency |
\( \log \left(1 + \frac {\max_{\{t' \in d\}} n_{t'}} {1 + n_t}\right) \) |
probabilisticInverse |
similar to inverse but substracting the number of documents the term appears in from the total number of documents in the training corpus (this can lead to positive and negative scores) |
\( \log \left( \frac {N - n_t} {1 + n_t} \right) \) |
The function takes a single input named input
that contains the list of words to be evaluated, with the column name being the term, and the value being the number of time the word is present in the document. This can be prepared using the tokenize function or any other method.
The function returns a single output named output
that contains the combined TF-IDF score for each term in the input.
tfidf.train
procedure type trains the data to use in this function.