Token Split Function

The token split function is a utility function that will insert a separating character before and after specified tokens in a string. This can be useful when preparing bags of words data for training. The function will add the specified split character only if there is not one already.


A new function of type tokensplit named <id> can be created as follows:

mldb.put("/v1/functions/"+<id>, {
    "type": "tokensplit",
    "params": {
        "tokens": <InputQuery>,
        "splitChars": <string>,
        "splitCharToInsert": <string>

with the following key-value definitions for params:

Field, Type, DefaultDescription


An SQL expression specifiying the list of tokens to separate.


A string containing the list of possible split characters. Each character in the list is interpreted as a splitchar.


A string containing the split character to insert if none of the characters in 'splitchars' are already present.

Input and Output Values

The function takes a single input named text that contains the string to parse and returns a single input named output that contains the input string with the split characters inserted.


As a example, consider a function of type tokensplit defined this way:

mldb.put("/v1/functions/split_smiley", {
    "type": "tokensplit",
    "params": {
        "tokens": "select ':P', '(>_<)', ':-)'",
        "splitChars": " "
        "splitCharToInsert": " "

Given this call

    q="select split_smiley({text: ':PGreat day!!! (>_<)(>_<) :P :P :P:-)'}) as x"

the function split_smiley will add spaces before and after emojis matching the list above but leave unchanged the ones that are already separated by a space.

":P Great day!!! (>_<) (>_<) :P :P :P :-)"

