The token split function is a utility function that will insert a separating character before and after specified tokens in a string. This can be useful when preparing bags of words data for training. The function will add the specified split character only if there is not one already.
A new function of type tokensplit
named <id>
can be created as follows:
mldb.put("/v1/functions/"+<id>, {
"type": "tokensplit",
"params": {
"tokens": <InputQuery>,
"splitChars": <string>,
"splitCharToInsert": <string>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
tokens | An SQL expression specifiying the list of tokens to separate. |
splitChars | A string containing the list of possible split characters. Each character in the list is interpreted as a splitchar. |
splitCharToInsert | A string containing the split character to insert if none of the characters in 'splitchars' are already present. |
The function takes a single input named text
that contains the string to parse and returns a single input named output
that contains
the input string with the split characters inserted.
As a example, consider a function of type tokensplit
defined this way:
mldb.put("/v1/functions/split_smiley", {
"type": "tokensplit",
"params": {
"tokens": "select ':P', '(>_<)', ':-)'",
"splitChars": " "
"splitCharToInsert": " "
}
})
Given this call
mldb.get("/v1/query",
q="select split_smiley({text: ':PGreat day!!! (>_<)(>_<) :P :P :P:-)'}) as x"
)
the function split_smiley
will add spaces before and after emojis matching the list above but leave unchanged the ones that are already separated by a space.
":P Great day!!! (>_<) (>_<) :P :P :P :-)"