The k-means function type takes the output dataset of the k-means procedure and applies it to new data in order to assign it to clusters.
A new function of type kmeans named <id> can be created as follows:
mldb.put("/v1/functions/"+<id>, {
"type": "kmeans",
"params": {
"modelFileUrl": <Url>
}
})with the following key-value definitions for params:
| Field, Type, Default | Description |
|---|---|
modelFileUrl | URL of the model file (with extension '.kms') to load. This file is created by the |
Functions of this type load their internal state from a dataset, which is identified
with the centroids parameter, which lists one cluster per row, with the columns
providing the coordinates of the centroid of the cluster and the row name being
the name of the cluster (this is the output format of the kmeans.train procedure type).
The select parameter
tells the system how to extract an embedding from the row, and the where
parameter allows only a subset of the clusters to be loaded by the K-Means
function. See the example below for more details.
In the application of the function, the same features as in the centroids
dataset are extracted from the inputs of the function to create a coordinate
vector for the input. The distance from that input to each of the centroids
is then calculated using the metric specified in the configuration, and the
function outputs the value cluster containing the name of cluster that is
the closest to the given centroid.
Functions of this type have a single input called embedding which is a row. The columns that
are expected in this row are the same as the columns in the centroids dataset with
which this function is configured.
These functions have a single output value called cluster, which is the name of the row
in the centroids dataset whose columns describe the point which is closest to the
input according to the metric specified.
As a concrete example, a K-Means function with three clusters, good, evil and undecided
and two input dimensions, happiness and malice, could be represented
as follows:
| Cluster | Happiness | Malice |
|---|---|---|
| Good | 1 | 0 |
| Evil | 0 | 1 |
| Undecided | 0.5 | 0.5 |
Which represents the following situation:
In that case, select would be set to * (the default) and where to true
(the default) in order to enable all clusters. Alternatively, if we wanted to
ignore the undecided class, we could load the K-Means function with where as
rowName() != 'undecided' to get only the good and evil clusters. And to
ignore the Happiness dimension, the K-Means function could be configured with
select as * EXCLUDING (Happiness).
kmeans.train procedure type trains a k-means function.