Stemming Functions

The following function types are used to apply a stemming algorithm to column names or strings.

There are two function types available, depending on the input to the function: stemmer for words and stemmerdoc for documents. From an NLP perspective, a string made up of many words in a cell is a document. Another way to represent a document is by transforming it in a bag of words, something that can be easily accomplished using the tokenize function. Each word then becomes a column.

Both representations can be useful to solve NLP problems. Use the corresponding stemming function type for the data representation:

These functions are a wrapper around the Snowball open-source stemming library. For more information about Snowball: http://snowball.tartarus.org/index.php


Stemmer Function

A function of this types creates a stemmer that can be used on column names.

Configuration

A new function of type stemmer named <id> can be created as follows:

mldb.put("/v1/functions/"+<id>, {
    "type": "stemmer",
    "params": {
        "language": <string>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

language
string
"english"

Stemming algorithm to use

The following values are valid for the language configuration field: danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish.

More information about the stemming algorithms is available on the Snowball page.

Input and Output Values

Functions of this type have a single input value named words which is a row, and a single output value also named words.

Example

After having created the following stemmer:

mldb.put("/v1/functions/my_stemmer", {
    "type": "stemmer",
    "params": {
        "language": "english"
    }
})

and this dataset:

rowName potato potatoes carrot carrots
row_0 1 2 3 None
row_1 'crips' 'chips' 0 'hi mom'

The following query will merge the stemmed columns, summing their content:

SELECT my_stemmer({words:{*}})[words] as * FROM our_dataset

and return:

rowName potato carrot
row_0 3 3
row_1 2 1

Note that strings are coerced to the integer value 1.

We can also nicely use it in conjunction with the tokenize function:

SELECT my_stemmer({words: {tokenize('I have liked having carrots', {splitChars:' '}) as *}}) as *

This returns:

words.I words.carrot words.have words.like
1 1 2 1


Stemmer on documents Function

A function of this type creates a stemmer that can be used on whole strings. It works in a way similar to the Stemmer Function and has the same configuration. However, its input and output formats are different. It will also stem each word in the string, using spaces as the separator.

Input and Output Values

Functions of this type have a single input value named document that is a string, and a single output value named stemmed document.

Example

After having created the following stemmer:

mldb.put("/v1/functions/my_stemmer", {
    "type": "stemmerdoc",
    "params": {
        "language": "english"
    }
})

we can apply the stemming algorithm on the provided string:

SELECT my_stemmer({document: 'I like having carrots'})

This returns:

"I like have carrot"

See also