This procedure allows for a truncated, abbreviated singular value decomposition to be trained over a dataset.This procedure trains an SVD model and stores the model file to disk and/or applies the model to the input data to produce an embedded output dataset for columns and/or rows.
A singular value decomposition is a way of more compactly representing a dataset, by taking advantage of the known relationships and redundancy between the different columns in the dataset. It is most often used to create an unsupervised embedding of a dataset. Mathematically, it looks like this:
\[ A \approx U \Delta V^T \]
Where A is the input matrix (dataset), U is an orthonormal matrix known as the left-singular vectors, V is an orthonormal matrix known as the right-singular vectors, and \(\Delta\) is a diagonal matrix with the entries on its diagonal known as the Singular Values.
Intuitively, \(U\) tells us about how to represent a row as a coordinate in a vector space, and \(V\) tells us how to represent a column as a coordinate in the same vector space. \( \Delta \) tells us how important each of those coordinates is in the reconstruction.
As input, it takes a single dataset \(A\). The algorithm is designed for the case where the input is tall and narrow, in other words where the number of rows is higher or much higher than the number of columns. It can work reasonably well up to hundreds of millions of rows and a few million columns.
The SVD algorithm is designed to be robust against real-world datasets, and so preprocessing is done on the columns before the algorithm is applied. Each input column is converted into one or more virtual columns as follows:
1
where the string has
the given value, and are missing else where.Columns that have multiple values for the same row have undefined results: they may use an average, or take an arbitrary one of the values, etc.
The following preprocessing is not performed, and should be done manually if needed:
'1'
versus 1
).The SVD computed is a truncated SVD. This means that it calculates a limited set of singular vectors and singular vectors that represent the input matrix as closely as possible. Typically between a few tens and a thousand or so singular values will be used.
The SVD is computed on an abbreviated subspace of a representitive sample of
columns from the initial column space. Currently, the numDenseBasisVectors
least
sparse columns are included in the subspace. This parameter can be used to
control the runtime of the algorithm when there are lots of sparse columns.
The embeddings of all columns are calculated, even if they are not one of the dense basis vectors.
The SVD algorithm produces three outputs:
svd.embedRow
function type.
This is a JSON formatted file, and so can be inspected or imported into another
system. It contains a columnIndex
map showing how to map sparse column
values onto expanded columns, a columns
array with the singular vector
and scaling information for each column, and a singularValues
array giving
the singular values of each columns.rowOutputDataset
section is filled in) containing the singular
vectors for each row in the input dataset. This will be a dense matrix
with the same number of
rows as the input dataset, and columns with names prefixed with the
outputColumn
and a 4 digit number for each of the singular values.A dataset (if the columnOutputDataset
section is filled in) containing the singular
vectors for each column in the input dataset. This will be a dense matrix
with a row for each of the virtual columns created as part of the preprocessing,
and the same columns as the rowOutputDataset
dataset. Note that this dataset is
transposed with respect to the input dataset.
The names of the columns in the output depend upon the type of the virtual columns:
.numericValue
;.equalString.
followed by the string value.A new procedure of type svd.train
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "svd.train",
"params": {
"trainingData": <InputQuery>,
"columnOutputDataset": <OutputDatasetSpec (Optional)>,
"rowOutputDataset": <OutputDatasetSpec (Optional)>,
"numSingularValues": <int>,
"numDenseBasisVectors": <int>,
"outputColumn": <PathElement>,
"modelFileUrl": <Url>,
"functionName": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
trainingData | Specification of the data for input to the SVD Procedure. This should be organized as an embedding, with each selected row containing the same set of columns with numeric values to be used as coordinates. The select statement does not support groupby and having clauses. |
columnOutputDataset | Output dataset for embedding (column singular vectors go here) |
rowOutputDataset | Output dataset for embedding (row singular vectors go here) |
numSingularValues | Maximum number of singular values to work with. If there are not enough degrees of freedom in the dataset (it is rank-deficient), then less than this number may be used |
numDenseBasisVectors | Maximum number of dense basis vectors to use for the SVD. This parameter gives the number of dimensions into which the projection is made. Higher values may allow the SVD to model slightly more diverse behaviour. The runtime goes up with the square of this parameter, in other words 10 times as many is 100 times as long to run. |
outputColumn | Base name of the column that will be written by the SVD. It will be an embedding with numSingularValues elements. |
modelFileUrl | URL where the model file (with extension '.svd') should be saved. This file can be loaded by the |
functionName | If specified, an instance of the |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
If the SVD produces all-zero vectors for rows or columns, it may be one of the following:
(For an SVD to produce meaningful results, it needs to be able to determine how a given column varies with other columns, which means it needs to be present in two or more rows, each of which contain columns present in two or more rows. Same holds in the other direction).