Token Split Function

The token split function is a utility function that will insert a separating character before and after specified tokens in a string. This can be useful when preparing bags of words data for training. The function will add the specified split character only if there is not one already.

Configuration

A new function of type tokensplit named <id> can be created as follows:

mldb.put("/v1/functions/"+<id>, {
    "type": "tokensplit",
    "params": {
        "tokens": <InputQuery>,
        "splitChars": <string>,
        "splitCharToInsert": <string>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

tokens
InputQuery

An SQL expression specifiying the list of tokens to separate.

splitChars
string
"<space>,"

A string containing the list of possible split characters. Each character in the list is interpreted as a splitchar.

splitCharToInsert
string
"<space>"

A string containing the split character to insert if none of the characters in 'splitchars' are already present.

Input and Output Values

The function takes a single input named text that contains the string to parse and returns a single input named output that contains the input string with the split characters inserted.

Example

As a example, consider a function of type tokensplit defined this way:

mldb.put("/v1/functions/split_smiley", {
    "type": "tokensplit",
    "params": {
        "tokens": "select ':P', '(>_<)', ':-)'",
        "splitChars": " "
        "splitCharToInsert": " "
        }
})

Given this call

mldb.get("/v1/query", 
    q="select split_smiley({text: ':PGreat day!!! (>_<)(>_<) :P :P :P:-)'}) as x"
)

the function split_smiley will add spaces before and after emojis matching the list above but leave unchanged the ones that are already separated by a space.

":P Great day!!! (>_<) (>_<) :P :P :P :-)"

See also