t-SNE Training Procedure

The t-SNE procedure is used to visualize complex datasets in a map. It does a good job at representing the structure of high dimensional data. This procedure trains a t-SNE model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset.

Configuration

A new procedure of type tsne.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "tsne.train",
    "params": {
        "trainingData": <InputQuery>,
        "rowOutputDataset": <OutputDatasetSpec>,
        "numInputDimensions": <int>,
        "numOutputDimensions": <int>,
        "tolerance": <float>,
        "perplexity": <float>,
        "learningRate": <float>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	Specification of the data for input to the TSNE procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses.
rowOutputDataset OutputDatasetSpec `{"type":"embedding"}`	Dataset for TSNE output, with embeddings of training data. One row will be added for each row in the input dataset, with a list of coordinates.
numInputDimensions int `-1`	Number of dimensions from the input to use. This will limit the columns to the n first columns in the alphabetical sorting of the columns (-1 = all).
numOutputDimensions int `2`	Number of dimensions to produce in t-SNE space. Normally this will be 2 or 3, depending upon the number of dimensions in the visualization.
tolerance float `0.0`	Tolerance of perplexity calculation. This is an internal parameter that only needs to be changed in rare circumstances.
perplexity float `3e1`	Perplexity to aim for; higher means more spread out. This controls how hard t-SNE tries to spread the points out. If the resulting output looks more like a ball or a sphere than individual clusters, you should reduce this number. If it looks like a dot or star, you should increase it.
learningRate float `5e2`	The learning rate specifies the gradient descent step size during optimization of the cost function. A learning rate that is too small may hold optimization in a local minimum. A learning rate that is too high may jump over the best optimal point. In general, the learning rate should be between 100 and 1000.
modelFileUrl Url	URL where the model file (with extension '.tsn') should be saved. This file can be loaded by the `tsne.embedRow` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `tsne.embedRow` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

The t-SNE procedure takes a high dimensional embedding as an input (often created from the application of the SVD Procedure and creates a low-dimensional embedding of the same data points (typically, there are two or three output dimensions). The input parameter points to a read-only dataset that is queried by the t-SNE training, and the output parameter describes a dataset to which the coordinates are written.

The input dataset can be filtered and modified with select and where statements. The where statement, in particular, may be necessary to limit the number of rows that are used and therefore limit the run-time of the algorithm. The algorithm used is Barnes-Hut SNE, which can produce maps of up to 100,000 points or so in a reasonable run-time.

The perplexity parameter requires further explanation. It controls how many neighbours each data point will try to have. Modifying the value of the parameter will affect the "clumpiness" of the data; for visualizing data it's pretty reasonable to hand-tune this parameter until a pleasing clustering is obtained.

Examples

The Mapping Reddit demo notebook
The Visualizing StackOverflow Tags demo notebook

t-SNE Training Procedure

Configuration

Examples

See also