Singular Value Decomposition Training Procedure

This procedure allows for a truncated, abbreviated singular value decomposition to be trained over a dataset.This procedure trains an SVD model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset for columns and/or rows.

Algorithm description

A singular value decomposition is a way of more compactly representing a dataset, by taking advantage of the known relationships and redundancy between the different columns in the dataset. It is most often used to create an unsupervised embedding of a dataset. Mathematically, it looks like this:

\[ A \approx U \Delta V^T \]

Where A is the input matrix (dataset), U is an orthonormal matrix known as the left-singular vectors, V is an orthonormal matrix known as the right-singular vectors, and \(\Delta\) is a diagonal matrix with the entries on its diagonal known as the Singular Values.

Intuitively, \(U\) tells us about how to represent a row as a coordinate in a vector space, and \(V\) tells us how to represent a column as a coordinate in the same vector space. \( \Delta \) tells us how important each of those coordinates is in the reconstruction.

As input, it takes a single dataset \(A\). The algorithm is designed for the case where the input is tall and narrow, in other words where the number of rows is higher or much higher than the number of columns. It can work reasonably well up to hundreds of millions of rows and a few million columns.

Preprocessing

The SVD algorithm is designed to be robust against real-world datasets, and so preprocessing is done on the columns before the algorithm is applied. Each input column is converted into one or more virtual columns as follows:

Real valued, dense columns are used as-is. Note that infinite values are not currently accepted. Not a Number ("NaN") values are assumed to be missing.
String valued columns are converted into multiple columns, with one column per string value. These sparse columns have the value 1 where the string has the given value, and are missing else where.
Mixed-value columns are currently not accepted.

Columns that have multiple values for the same row have undefined results: they may use an average, or take an arbitrary one of the values, etc.

The following preprocessing is not performed, and should be done manually if needed:

Strings values are not converted to bag-of-words representations or anything like that. That preprocessing needs to be performed manually.
In the case of categorical columns encoded as integers, the algorithm will treat it as a real value. For categorical columns, it's best to use strings, even if the string just encodes the number ('1' versus 1).

Truncation

The SVD computed is a truncated SVD. This means that it calculates a limited set of singular vectors and singular vectors that represent the input matrix as closely as possible. Typically between a few tens and a thousand or so singular values will be used.

Abbreviation

The SVD is computed on an abbreviated subspace of a representitive sample of columns from the initial column space. Currently, the numDenseBasisVectors least sparse columns are included in the subspace. This parameter can be used to control the runtime of the algorithm when there are lots of sparse columns.

The embeddings of all columns are calculated, even if they are not one of the dense basis vectors.

Format of the output

The SVD algorithm produces three outputs:

A SVD model file, which can be loaded by the svd.embedRow function type. This is a JSON formatted file, and so can be inspected or imported into another system. It contains a columnIndex map showing how to map sparse column values onto expanded columns, a columns array with the singular vector and scaling information for each column, and a singularValues array giving the singular values of each columns.
A dataset (if the rowOutputDataset section is filled in) containing the singular vectors for each row in the input dataset. This will be a dense matrix with the same number of rows as the input dataset, and columns with names prefixed with the outputColumn and a 4 digit number for each of the singular values.
A dataset (if the columnOutputDataset section is filled in) containing the singular vectors for each column in the input dataset. This will be a dense matrix with a row for each of the virtual columns created as part of the preprocessing, and the same columns as the rowOutputDataset dataset. Note that this dataset is transposed with respect to the input dataset.

The names of the columns in the output depend upon the type of the virtual columns:
- A virtual column for the numeric value of a column has a row name the same as the column name with the suffix .numericValue;
- A virtual column for a string column has a row name that is the column name with the suffix .equalString. followed by the string value.

Configuration

A new procedure of type svd.train named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "svd.train",
    "params": {
        "trainingData": <InputQuery>,
        "columnOutputDataset": <OutputDatasetSpec (Optional)>,
        "rowOutputDataset": <OutputDatasetSpec (Optional)>,
        "numSingularValues": <int>,
        "numDenseBasisVectors": <int>,
        "outputColumn": <PathElement>,
        "modelFileUrl": <Url>,
        "functionName": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
trainingData InputQuery	Specification of the data for input to the SVD Procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses.
columnOutputDataset OutputDatasetSpec (Optional) `{"type":"embedding"}`	Output dataset for embedding (column singular vectors go here)
rowOutputDataset OutputDatasetSpec (Optional) `{"type":"embedding"}`	Output dataset for embedding (row singular vectors go here)
numSingularValues int `100`	Maximum number of singular values to work with. If there are not enough degrees of freedom in the dataset (it is rank-deficient), then less than this number may be used
numDenseBasisVectors int `2000`	Maximum number of dense basis vectors to use for the SVD. This parameter gives the number of dimensions into which the projection is made. Higher values may allow the SVD to model slightly more diverse behaviour. The runtime goes up with the square of this parameter, in other words 10 times as many is 100 times as long to run.
outputColumn PathElement `"embedding"`	Base name of the column that will be written by the SVD. It will be an embedding with numSingularValues elements.
modelFileUrl Url	URL where the model file (with extension '.svd') should be saved. This file can be loaded by the `svd.embedRow` function type. This parameter is optional unless the `functionName` parameter is used.
functionName string	If specified, an instance of the `svd.embedRow` function type of this name will be created using the trained model. Note that to use this parameter, the `modelFileUrl` must also be provided.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Restrictions

The SVD algorithm as implemented is designed for the embedding of high dimensional datasets into a lower dimensional space. In the case of rank-constrained input datasets (ie, very simple datasets with only a few variables and clear relationships between them), it will not necessarily give good results. For those, it may be better to skip the embedding step altogether.
Columns with a mixture of strings and numbers are currently not accepted. This will be rectified in a future release.
Columns that contain infinite values are currently not accepted. This will be rectified in a future release.

Troubleshooting

If the SVD produces all-zero vectors for rows or columns, it may be one of the following:

A column which is present in only a single row with no other columns will have a zero embedding vector
A row which is present only in a single column with no other rows will have a zero embedding vector
If all of the values in the columns for a given row are zero, the row will have a zero embedding vector
If all of the values in the rows for a given column are zero, the column will have a zero embedding vector
An empty row or empty column will have a zero embedding vector

(For an SVD to produce meaningful results, it needs to be able to determine how a given column varies with other columns, which means it needs to be present in two or more rows, each of which contain columns present in two or more rows. Same holds in the other direction).

Examples

The Recommending Movies demo notebook
The Mapping Reddit demo notebook
The Visualizing StackOverflow Tags demo notebook