t-SNE Training Procedure

The t-SNE procedure is used to visualize complex datasets in a map. It does a good job at representing the structure of high dimensional data. This procedure trains a t-SNE model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset.

Configuration

A new procedure of type tsne.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "tsne.train",
    "params": {
        "trainingData": <InputQuery>,
        "rowOutputDataset": <OutputDatasetSpec>,
        "numInputDimensions": <int>,
        "numOutputDimensions": <int>,
        "tolerance": <float>,
        "perplexity": <float>,
        "learningRate": <float>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

Specification of the data for input to the TSNE procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses.

rowOutputDataset
OutputDatasetSpec
{"type":"embedding"}

Dataset for TSNE output, with embeddings of training data. One row will be added for each row in the input dataset, with a list of coordinates.

numInputDimensions
int
-1

Number of dimensions from the input to use. This will limit the columns to the n first columns in the alphabetical sorting of the columns (-1 = all).

numOutputDimensions
int
2

Number of dimensions to produce in t-SNE space. Normally this will be 2 or 3, depending upon the number of dimensions in the visualization.

tolerance
float
0.0

Tolerance of perplexity calculation. This is an internal parameter that only needs to be changed in rare circumstances.

perplexity
float
3e1

Perplexity to aim for; higher means more spread out. This controls how hard t-SNE tries to spread the points out. If the resulting output looks more like a ball or a sphere than individual clusters, you should reduce this number. If it looks like a dot or star, you should increase it.

learningRate
float
5e2

The learning rate specifies the gradient descent step size during optimization of the cost function. A learning rate that is too small may hold optimization in a local minimum. A learning rate that is too high may jump over the best optimal point. In general, the learning rate should be between 100 and 1000.

modelFileUrl
Url

URL where the model file (with extension '.tsn') should be saved. This file can be loaded by the tsne.embedRow function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the tsne.embedRow function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

The t-SNE procedure takes a high dimensional embedding as an input (often created from the application of the SVD Procedure and creates a low-dimensional embedding of the same data points (typically, there are two or three output dimensions). The input parameter points to a read-only dataset that is queried by the t-SNE training, and the output parameter describes a dataset to which the coordinates are written.

The input dataset can be filtered and modified with select and where statements. The where statement, in particular, may be necessary to limit the number of rows that are used and therefore limit the run-time of the algorithm. The algorithm used is Barnes-Hut SNE, which can produce maps of up to 100,000 points or so in a reasonable run-time.

The perplexity parameter requires further explanation. It controls how many neighbours each data point will try to have. Modifying the value of the parameter will affect the "clumpiness" of the data; for visualizing data it's pretty reasonable to hand-tune this parameter until a pleasing clustering is obtained.

Examples

See also