The following function types are used to apply a stemming algorithm to column names or strings.
There are two function types available, depending on the input to the function: stemmer
for words
and stemmerdoc
for documents. From an NLP
perspective, a string made up of many words in a cell is a document. Another
way to represent a document is by transforming it in a
bag of words, something that can be easily
accomplished using the tokenize
function. Each word then becomes a column.
Both representations can be useful to solve NLP problems. Use the corresponding stemming function type for the data representation:
tokenize
built-in function)These functions are a wrapper around the Snowball open-source stemming library. For more information about Snowball: http://snowball.tartarus.org/index.php
A function of this types creates a stemmer that can be used on column names.
A new function of type stemmer
named <id>
can be created as follows:
mldb.put("/v1/functions/"+<id>, {
"type": "stemmer",
"params": {
"language": <string>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
language | Stemming algorithm to use |
The following values are valid for the language
configuration field: danish, dutch,
english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese,
romanian, russian, spanish, swedish, turkish.
More information about the stemming algorithms is available on the Snowball page.
Functions of this type have a single input value named words
which is a row, and
a single output value also named words
.
After having created the following stemmer:
mldb.put("/v1/functions/my_stemmer", {
"type": "stemmer",
"params": {
"language": "english"
}
})
and this dataset:
rowName | potato | potatoes | carrot | carrots |
---|---|---|---|---|
row_0 | 1 | 2 | 3 | None |
row_1 | 'crips' | 'chips' | 0 | 'hi mom' |
The following query will merge the stemmed columns, summing their content:
SELECT my_stemmer({words:{*}})[words] as * FROM our_dataset
and return:
rowName | potato | carrot |
---|---|---|
row_0 | 3 | 3 |
row_1 | 2 | 1 |
Note that strings are coerced to the integer value 1.
We can also nicely use it in conjunction with the tokenize function:
SELECT my_stemmer({words: {tokenize('I have liked having carrots', {splitChars:' '}) as *}}) as *
This returns:
words.I | words.carrot | words.have | words.like |
---|---|---|---|
1 | 1 | 2 | 1 |
A function of this type creates a stemmer that can be used on whole strings. It works in a way similar to the Stemmer Function and has the same configuration. However, its input and output formats are different. It will also stem each word in the string, using spaces as the separator.
Functions of this type have a single input value named document
that is a string, and
a single output value named stemmed document
.
After having created the following stemmer:
mldb.put("/v1/functions/my_stemmer", {
"type": "stemmerdoc",
"params": {
"language": "english"
}
})
we can apply the stemming algorithm on the provided string:
SELECT my_stemmer({document: 'I like having carrots'})
This returns:
"I like have carrot"