This procedure allows for a truncated, abbreviated singular value decomposition to be trained over a dataset.This procedure trains an SVD model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset for columns and/or rows.

A singular value decomposition is a way of more compactly representing a dataset, by taking advantage of the known relationships and redundancy between the different columns in the dataset. It is most often used to create an unsupervised embedding of a dataset. Mathematically, it looks like this:

\[ A \approx U \Delta V^T \]

Where A is the input matrix (dataset), U is an orthonormal matrix known as the left-singular vectors, V is an orthonormal matrix known as the right-singular vectors, and \(\Delta\) is a diagonal matrix with the entries on its diagonal known as the Singular Values.

Intuitively, \(U\) tells us about how to represent a row as a coordinate in a vector space, and \(V\) tells us how to represent a column as a coordinate in the same vector space. \( \Delta \) tells us how important each of those coordinates is in the reconstruction.

As input, it takes a single dataset \(A\). The algorithm is designed for the case where the input is tall and narrow, in other words where the number of rows is higher or much higher than the number of columns. It can work reasonably well up to hundreds of millions of rows and a few million columns.

The SVD algorithm is designed to be robust against real-world datasets, and so preprocessing is done on the columns before the algorithm is applied. Each input column is converted into one or more virtual columns as follows:

- Real valued, dense columns are used as-is. Note that infinite values are not currently accepted. Not a Number ("NaN") values are assumed to be missing.
- String valued columns are converted into multiple columns, with one column
per string value. These sparse columns have the value
`1`

where the string has the given value, and are missing else where. - Mixed-value columns are currently not accepted.

Columns that have multiple values for the same row have undefined results: they may use an average, or take an arbitrary one of the values, etc.

The following preprocessing is not performed, and should be done manually if needed:

- Strings values are not converted to bag-of-words representations or anything like that. That preprocessing needs to be performed manually.
- In the case of categorical columns encoded as integers, the algorithm will
treat it as a real value. For categorical columns, it's best to use strings,
even if the string just encodes the number (
`'1'`

versus`1`

).

The SVD computed is a *truncated* SVD. This means that it calculates a limited set
of singular vectors and singular vectors that represent the input matrix as closely
as possible. Typically between a few tens and a thousand or so singular values will
be used.

The SVD is computed on an abbreviated subspace of a representitive sample of
columns from the initial column space. Currently, the `numDenseBasisVectors`

least
sparse columns are included in the subspace. This parameter can be used to
control the runtime of the algorithm when there are lots of sparse columns.

The embeddings of all columns are calculated, even if they are not one of the dense basis vectors.

The SVD algorithm produces three outputs:

- A SVD model file, which can be loaded by the
`svd.embedRow`

function type. This is a JSON formatted file, and so can be inspected or imported into another system. It contains a`columnIndex`

map showing how to map sparse column values onto expanded columns, a`columns`

array with the singular vector and scaling information for each column, and a`singularValues`

array giving the singular values of each columns. - A dataset (if the
`rowOutputDataset`

section is filled in) containing the singular vectors for each row in the input dataset. This will be a dense matrix with the same number of rows as the input dataset, and columns with names prefixed with the`outputColumn`

and a 4 digit number for each of the singular values. A dataset (if the

`columnOutputDataset`

section is filled in) containing the singular vectors for each column in the input dataset. This will be a dense matrix with a row for each of the virtual columns created as part of the preprocessing, and the same columns as the`rowOutputDataset`

dataset. Note that this dataset is transposed with respect to the input dataset.The names of the columns in the output depend upon the type of the virtual columns:

- A virtual column for the numeric value of a column has a row name the
same as the column name with the suffix
`.numericValue`

; - A virtual column for a string column has a row name that is the column
name with the suffix
`.equalString.`

followed by the string value.

- A virtual column for the numeric value of a column has a row name the
same as the column name with the suffix

A new procedure of type `svd.train`

named `<id>`

can be created as follows:

```
mldb.put("/v1/procedures/"+<id>, {
"type": "svd.train",
"params": {
"trainingData": <InputQuery>,
"columnOutputDataset": <OutputDatasetSpec (Optional)>,
"rowOutputDataset": <OutputDatasetSpec (Optional)>,
"numSingularValues": <int>,
"numDenseBasisVectors": <int>,
"outputColumn": <PathElement>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
```

with the following key-value definitions for `params`

:

Field, Type, Default | Description |
---|---|

| Specification of the data for input to the SVD Procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses. |

| Output dataset for embedding (column singular vectors go here) |

| Output dataset for embedding (row singular vectors go here) |

| Maximum number of singular values to work with. If there are not enough degrees of freedom in the dataset (it is rank-deficient), then less than this number may be used |

| Maximum number of dense basis vectors to use for the SVD. This parameter gives the number of dimensions into which the projection is made. Higher values may allow the SVD to model slightly more diverse behaviour. The runtime goes up with the square of this parameter, in other words 10 times as many is 100 times as long to run. |

| Base name of the column that will be written by the SVD. It will be an embedding with numSingularValues elements. |

| URL where the model file (with extension '.svd') should be saved. This file can be loaded by the |

| If specified, an instance of the |

| If true, the procedure will be run immediately. The response will contain an extra field called |

- The SVD algorithm as implemented is designed for the embedding of high dimensional datasets into a lower dimensional space. In the case of rank-constrained input datasets (ie, very simple datasets with only a few variables and clear relationships between them), it will not necessarily give good results. For those, it may be better to skip the embedding step altogether.
- Columns with a mixture of strings and numbers are currently not accepted. This will be rectified in a future release.
- Columns that contain infinite values are currently not accepted. This will be rectified in a future release.

If the SVD produces all-zero vectors for rows or columns, it may be one of the following:

- A column which is present in only a single row with no other columns will have a zero embedding vector
- A row which is present only in a single column with no other rows will have a zero embedding vector
- If all of the values in the columns for a given row are zero, the row will have a zero embedding vector
- If all of the values in the rows for a given column are zero, the column will have a zero embedding vector
- An empty row or empty column will have a zero embedding vector

(For an SVD to produce meaningful results, it needs to be able to determine how a given column varies with other columns, which means it needs to be present in two or more rows, each of which contain columns present in two or more rows. Same holds in the other direction).

- The Recommending Movies demo notebook
- The Mapping Reddit demo notebook
- The Visualizing StackOverflow Tags demo notebook