Singular Value Decomposition Training Procedure

This procedure allows for a truncated, abbreviated singular value decomposition to be trained over a dataset.This procedure trains an SVD model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset for columns and/or rows.

Algorithm description

A singular value decomposition is a way of more compactly representing a dataset, by taking advantage of the known relationships and redundancy between the different columns in the dataset. It is most often used to create an unsupervised embedding of a dataset. Mathematically, it looks like this:

\[ A \approx U \Delta V^T \]

Where A is the input matrix (dataset), U is an orthonormal matrix known as the left-singular vectors, V is an orthonormal matrix known as the right-singular vectors, and \(\Delta\) is a diagonal matrix with the entries on its diagonal known as the Singular Values.

Intuitively, \(U\) tells us about how to represent a row as a coordinate in a vector space, and \(V\) tells us how to represent a column as a coordinate in the same vector space. \( \Delta \) tells us how important each of those coordinates is in the reconstruction.

As input, it takes a single dataset \(A\). The algorithm is designed for the case where the input is tall and narrow, in other words where the number of rows is higher or much higher than the number of columns. It can work reasonably well up to hundreds of millions of rows and a few million columns.

Preprocessing

The SVD algorithm is designed to be robust against real-world datasets, and so preprocessing is done on the columns before the algorithm is applied. Each input column is converted into one or more virtual columns as follows:

Columns that have multiple values for the same row have undefined results: they may use an average, or take an arbitrary one of the values, etc.

The following preprocessing is not performed, and should be done manually if needed:

  1. Strings values are not converted to bag-of-words representations or anything like that. That preprocessing needs to be performed manually.
  2. In the case of categorical columns encoded as integers, the algorithm will treat it as a real value. For categorical columns, it's best to use strings, even if the string just encodes the number ('1' versus 1).

Truncation

The SVD computed is a truncated SVD. This means that it calculates a limited set of singular vectors and singular vectors that represent the input matrix as closely as possible. Typically between a few tens and a thousand or so singular values will be used.

Abbreviation

The SVD is computed on an abbreviated subspace of a representitive sample of columns from the initial column space. Currently, the numDenseBasisVectors least sparse columns are included in the subspace. This parameter can be used to control the runtime of the algorithm when there are lots of sparse columns.

The embeddings of all columns are calculated, even if they are not one of the dense basis vectors.

Format of the output

The SVD algorithm produces three outputs:

Configuration

A new procedure of type svd.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "svd.train",
    "params": {
        "trainingData": <InputQuery>,
        "columnOutputDataset": <OutputDatasetSpec (Optional)>,
        "rowOutputDataset": <OutputDatasetSpec (Optional)>,
        "numSingularValues": <int>,
        "numDenseBasisVectors": <int>,
        "outputColumn": <PathElement>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

trainingData
InputQuery

Specification of the data for input to the SVD Procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses.

columnOutputDataset
OutputDatasetSpec (Optional)
{"type":"embedding"}

Output dataset for embedding (column singular vectors go here)

rowOutputDataset
OutputDatasetSpec (Optional)
{"type":"embedding"}

Output dataset for embedding (row singular vectors go here)

numSingularValues
int
100

Maximum number of singular values to work with. If there are not enough degrees of freedom in the dataset (it is rank-deficient), then less than this number may be used

numDenseBasisVectors
int
2000

Maximum number of dense basis vectors to use for the SVD. This parameter gives the number of dimensions into which the projection is made. Higher values may allow the SVD to model slightly more diverse behaviour. The runtime goes up with the square of this parameter, in other words 10 times as many is 100 times as long to run.

outputColumn
PathElement
"embedding"

Base name of the column that will be written by the SVD. It will be an embedding with numSingularValues elements.

modelFileUrl
Url

URL where the model file (with extension '.svd') should be saved. This file can be loaded by the svd.embedRow function type. This parameter is optional unless the functionName parameter is used.

functionName
string

If specified, an instance of the svd.embedRow function type of this name will be created using the trained model. Note that to use this parameter, the modelFileUrl must also be provided.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Restrictions

Troubleshooting

If the SVD produces all-zero vectors for rows or columns, it may be one of the following:

(For an SVD to produce meaningful results, it needs to be able to determine how a given column varies with other columns, which means it needs to be present in two or more rows, each of which contain columns present in two or more rows. Same holds in the other direction).

Examples

See also