This procedure trains a K-means clustering model and stores the result model (i.e. cluster centroids) into an output dataset, as well as storing the cluster labels for its input dataset into a separate output dataset.
A new procedure of type kmeans.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "kmeans.train",
"params": {
"trainingData": <InputQuery>,
"outputDataset": <OutputDatasetSpec (Optional)>,
"centroidsDataset": <OutputDatasetSpec (Optional)>,
"numInputDimensions": <int>,
"numClusters": <int>,
"maxIterations": <int>,
"metric": <MetricSpace>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | Specification of the data for input to the k-means procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses. |
outputDataset | Dataset for cluster assignment. This dataset will contain the same row names as the input dataset, but the coordinates will be replaced by a single column giving the cluster number that the row was assigned to. |
centroidsDataset | Dataset in which the centroids will be recorded. This dataset will have the same coordinates (columns) as those selected from the input dataset, but will have one row per cluster, providing the centroid of the cluster. |
numInputDimensions | Number of dimensions from the input to use (-1 = all). This limits the number of columns used. Columns will be ordered alphabetically and the lowest ones kept. |
numClusters | Number of clusters to create. This will provide the total number of centroids created. There must be at least as many rows selected as clusters. |
maxIterations | Maximum number of iterations to perform. If no convergeance is reached within this number of iterations, the current clustering will be returned. |
metric | Metric space in which the k-means distances will be calculated. Normally this will be Cosine for an orthonormal basis, and Euclidian for another basis |
modelFileUrl | URL where the model file (with extension '.kms') should be saved. This file can be loaded by the |
functionName | If specified, an instance of the |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
MetricSpace
Value | Description |
---|---|
none | No metric is chosen. This will cause an error. |
euclidean | Use Euclidian distance for metric. This is a good choice for geometric embeddings like the t-SNE algorithm. |
cosine | Use cosine distance for metric. This is a good choice for normalized and high-dimensional embeddings like the SVD. |
The k-means procedure is used to take a set of points, each of which is characterized by its coordinates in an embedding space, and group them such that each one belongs to one cluster. The clusters are described by a single point that is the cluster centroid.
The input dataset has one row per point, with the coordinates being in the columns. There must be the same set of numeric coordinates per row.
As an example, the following input would be suitable for the k-means algorithm:
rowName | x | y |
---|---|---|
row1 | 1 | 4 |
row2 | 1 | 3 |
row3 | 3 | 1 |
row4 | 4 | 1 |
Using the k-means procedure with the Euclidean metric to create two clusters (i.e. with metric
set to Euclidean
and numClusters
set to 2),
the output would be the centroids:
_rowName | x | y |
---|---|---|
"0" | 1 | 3.5 |
"1" | 3.5 | 1 |
Using the output of the procedure to classify new points is done by calculating the distance from the point to each of the cluster centroids, and then assigning the point to the cluster with the shortest distance.
kmeans
function type applies the centroids to new data points to deternine
which cluster they fit in to.