K-Means Training Procedure

This procedure trains a K-means clustering model and stores the result model (i.e. cluster centroids) into an output dataset, as well as storing the cluster labels for its input dataset into a separate output dataset.

Configuration

A new procedure of type kmeans.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "kmeans.train",
    "params": {
        "trainingData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec (Optional)>,
        "centroidsDataset": <OutputDatasetSpec (Optional)>,
        "numInputDimensions": <int>,
        "numClusters": <int>,
        "maxIterations": <int>,
        "metric": <MetricSpace>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	Specification of the data for input to the k-means procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses.
outputDataset OutputDatasetSpec (Optional) `{"type":"embedding"}`	Dataset for cluster assignment. This dataset will contain the same row names as the input dataset, but the coordinates will be replaced by a single column giving the cluster number that the row was assigned to.
centroidsDataset OutputDatasetSpec (Optional) `{"type":"embedding"}`	Dataset in which the centroids will be recorded. This dataset will have the same coordinates (columns) as those selected from the input dataset, but will have one row per cluster, providing the centroid of the cluster.
numInputDimensions int `-1`	Number of dimensions from the input to use (-1 = all). This limits the number of columns used. Columns will be ordered alphabetically and the lowest ones kept.
numClusters int `10`	Number of clusters to create. This will provide the total number of centroids created. There must be at least as many rows selected as clusters.
maxIterations int `100`	Maximum number of iterations to perform. If no convergeance is reached within this number of iterations, the current clustering will be returned.
metric MetricSpace `"cosine"`	Metric space in which the k-means distances will be calculated. Normally this will be Cosine for an orthonormal basis, and Euclidian for another basis
modelFileUrl Url	URL where the model file (with extension '.kms') should be saved. This file can be loaded by the `kmeans` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `kmeans` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Enumeration `MetricSpace`

Value	Description
`none`	No metric is chosen. This will cause an error.
`euclidean`	Use Euclidian distance for metric. This is a good choice for geometric embeddings like the t-SNE algorithm.
`cosine`	Use cosine distance for metric. This is a good choice for normalized and high-dimensional embeddings like the SVD.

Training

The k-means procedure is used to take a set of points, each of which is characterized by its coordinates in an embedding space, and group them such that each one belongs to one cluster. The clusters are described by a single point that is the cluster centroid.

The input dataset has one row per point, with the coordinates being in the columns. There must be the same set of numeric coordinates per row.

As an example, the following input would be suitable for the k-means algorithm:

rowName	x	y
row1	1	4
row2	1	3
row3	3	1
row4	4	1

Using the k-means procedure with the Euclidean metric to create two clusters (i.e. with metric set to Euclidean and numClusters set to 2), the output would be the centroids:

_rowName	x	y
"0"	1	3.5
"1"	3.5	1

Using the output of the procedure to classify new points is done by calculating the distance from the point to each of the cluster centroids, and then assigning the point to the cluster with the shortest distance.

Examples

The Mapping Reddit demo notebook
The Visualizing StackOverflow Tags demo notebook

K-Means Training Procedure

Configuration

Enumeration MetricSpace

Training

Examples

See also

Enumeration `MetricSpace`