The t-SNE procedure is used to visualize complex datasets in a map. It does a good job at representing the structure of high dimensional data. This procedure trains a t-SNE model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset.
A new procedure of type tsne.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "tsne.train",
"params": {
"trainingData": <InputQuery>,
"rowOutputDataset": <OutputDatasetSpec>,
"numInputDimensions": <int>,
"numOutputDimensions": <int>,
"tolerance": <float>,
"perplexity": <float>,
"learningRate": <float>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | Specification of the data for input to the TSNE procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses. |
rowOutputDataset | Dataset for TSNE output, with embeddings of training data. One row will be added for each row in the input dataset, with a list of coordinates. |
numInputDimensions | Number of dimensions from the input to use. This will limit the columns to the n first columns in the alphabetical sorting of the columns (-1 = all). |
numOutputDimensions | Number of dimensions to produce in t-SNE space. Normally this will be 2 or 3, depending upon the number of dimensions in the visualization. |
tolerance | Tolerance of perplexity calculation. This is an internal parameter that only needs to be changed in rare circumstances. |
perplexity | Perplexity to aim for; higher means more spread out. This controls how hard t-SNE tries to spread the points out. If the resulting output looks more like a ball or a sphere than individual clusters, you should reduce this number. If it looks like a dot or star, you should increase it. |
learningRate | The learning rate specifies the gradient descent step size during optimization of the cost function. A learning rate that is too small may hold optimization in a local minimum. A learning rate that is too high may jump over the best optimal point. In general, the learning rate should be between 100 and 1000. |
modelFileUrl | URL where the model file (with extension '.tsn') should be saved. This file can be loaded by the |
functionName | If specified, an instance of the |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
The t-SNE procedure takes a high dimensional embedding as an input (often
created from the application of the SVD Procedure and creates a low-dimensional
embedding of the same data points (typically, there are two or three output
dimensions). The input
parameter points to a read-only dataset that is queried
by the t-SNE training, and the output
parameter describes a dataset to which
the coordinates are written.
The input dataset can be filtered and modified with select and where statements. The where statement, in particular, may be necessary to limit the number of rows that are used and therefore limit the run-time of the algorithm. The algorithm used is Barnes-Hut SNE, which can produce maps of up to 100,000 points or so in a reasonable run-time.
The perplexity
parameter requires further explanation. It controls how many
neighbours each data point will try to have. Modifying the value of the parameter
will affect the "clumpiness" of the data; for visualizing data it's pretty
reasonable to hand-tune this parameter until a pleasing clustering is obtained.
svd.train
procedure type trains an SVD, often used as the input to t-SNE.