Embedding Dataset

The embedding dataset can store a fixed-length coordinate vector in each row. It is used to store the output of embeddings, and enables them to be queried efficiently for nearest neighbours type queries.

The embedding dataset has strict requirements:

Currently, the embedding dataset can only exist in memory.

The dataset is typically used as the output of a procedure that generates the embedding, such as the tsne.train procedure type, the svd.train procedure type or the kmeans.train procedure type

Configuration

A new dataset of type embedding named <id> can be created as follows:

mldb.put("/v1/datasets/"+<id>, {
    "type": "embedding",
    "params": {
        "metric": <MetricSpace>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

metric
MetricSpace
"euclidean"

Metric space which is used to index the data for nearest neighbors calculations. Options are 'cosine' (which is good for normalized embeddings like the SVD) and 'euclidean' (which is good for geometric embeddings like the t-SNE algorithm).

Metric space

The metric field has the following possibilities:

Enumeration MetricSpace

ValueDescription
none

No metric is chosen. This will cause an error.

euclidean

Use Euclidian distance for metric. This is a good choice for geometric embeddings like the t-SNE algorithm.

cosine

Use cosine distance for metric. This is a good choice for normalized and high-dimensional embeddings like the SVD.

Querying Nearest Neighbors

The embedding dataset stores an index in a Vantage Point Tree which allows for efficient queries of points that are close in the embedding space. This can be used for nearest-neighbors searches, which when combined with a good embedding algorithm can be used to implement recommendations.

See the embedding.neighbors function type for more details.

Examples

See Also