The embedding dataset can store a fixed-length coordinate vector in each row. It is used to store the output of embeddings, and enables them to be queried efficiently for nearest neighbours type queries.

The embedding dataset has strict requirements:

- Each row can only be recorded once
- Each recorded row must have exactly the same set of columns
- Each column value must be a number, and not an infinity or a NaN
- No column can have a null value, or a string value

Currently, the embedding dataset can only exist in memory.

The dataset is typically used as the
output of a procedure that generates the embedding, such as the `tsne.train`

procedure type, the `svd.train`

procedure type or the `kmeans.train`

procedure type

A new dataset of type `embedding`

named `<id>`

can be created as follows:

```
mldb.put("/v1/datasets/"+<id>, {
"type": "embedding",
"params": {
"metric": <MetricSpace>
}
})
```

with the following key-value definitions for `params`

:

Field, Type, Default | Description |
---|---|

| Metric space which is used to index the data for nearest neighbors calculations. Options are 'cosine' (which is good for normalized embeddings like the SVD) and 'euclidean' (which is good for geometric embeddings like the t-SNE algorithm). |

The metric field has the following possibilities:

`MetricSpace`

Value | Description |
---|---|

`none` | No metric is chosen. This will cause an error. |

`euclidean` | Use Euclidian distance for metric. This is a good choice for geometric embeddings like the t-SNE algorithm. |

`cosine` | Use cosine distance for metric. This is a good choice for normalized and high-dimensional embeddings like the SVD. |

The embedding dataset stores an index in a Vantage Point Tree which allows for efficient queries of points that are close in the embedding space. This can be used for nearest-neighbors searches, which when combined with a good embedding algorithm can be used to implement recommendations.

See the `embedding.neighbors`

function type for more details.

- The Recommending Movies demo notebook
- The Exploring Favourite Recipes demo notebook

- Vantage Point Tree is the data structure used to allow quick lookups
- the
`embedding.neighbors`

function type is used to find nearest neighbors in an embedding dataset. - the
`kmeans.train`

procedure type is another way of identifying similar points. - the
`svd.train`

procedure type procedure is often used to train an embedding with a high number of dimensions - the
`tsne.train`

procedure type can be used to train a 2 or 3 dimensional embedding