K-Means Training Procedure

This procedure trains a K-means clustering model and stores the result model (i.e. cluster centroids) into an output dataset, as well as storing the cluster labels for its input dataset into a separate output dataset.

Configuration

A new procedure of type kmeans.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "kmeans.train",
    "params": {
        "trainingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "centroidsDataset": <OutputDatasetSpec (Optional)>,
        "numInputDimensions": <int>,
        "numClusters": <int>,
        "maxIterations": <int>,
        "metric": <MetricSpace>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

Specification of the data for input to the k-means procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses.

outputDataset
OutputDatasetSpec (Optional)
{"type":"embedding"}

Dataset for cluster assignment. This dataset will contain the same row names as the input dataset, but the coordinates will be replaced by a single column giving the cluster number that the row was assigned to.

centroidsDataset
OutputDatasetSpec (Optional)
{"type":"embedding"}

Dataset in which the centroids will be recorded. This dataset will have the same coordinates (columns) as those selected from the input dataset, but will have one row per cluster, providing the centroid of the cluster.

numInputDimensions
int
-1

Number of dimensions from the input to use (-1 = all). This limits the number of columns used. Columns will be ordered alphabetically and the lowest ones kept.

numClusters
int
10

Number of clusters to create. This will provide the total number of centroids created. There must be at least as many rows selected as clusters.

maxIterations
int
100

Maximum number of iterations to perform. If no convergeance is reached within this number of iterations, the current clustering will be returned.

metric
MetricSpace
"cosine"

Metric space in which the k-means distances will be calculated. Normally this will be Cosine for an orthonormal basis, and Euclidian for another basis

modelFileUrl
Url

URL where the model file (with extension '.kms') should be saved. This file can be loaded by the kmeans function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the kmeans function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Enumeration MetricSpace

ValueDescription
none

No metric is chosen. This will cause an error.

euclidean

Use Euclidian distance for metric. This is a good choice for geometric embeddings like the t-SNE algorithm.

cosine

Use cosine distance for metric. This is a good choice for normalized and high-dimensional embeddings like the SVD.

Training

The k-means procedure is used to take a set of points, each of which is characterized by its coordinates in an embedding space, and group them such that each one belongs to one cluster. The clusters are described by a single point that is the cluster centroid.

The input dataset has one row per point, with the coordinates being in the columns. There must be the same set of numeric coordinates per row.

As an example, the following input would be suitable for the k-means algorithm:

rowName x y
row1 1 4
row2 1 3
row3 3 1
row4 4 1

Using the k-means procedure with the Euclidean metric to create two clusters (i.e. with metric set to Euclidean and numClusters set to 2), the output would be the centroids:

_rowName x y
"0" 1 3.5
"1" 3.5 1

Using the output of the procedure to classify new points is done by calculating the distance from the point to each of the cluster centroids, and then assigning the point to the cluster with the shortest distance.

Examples

See also