Transform Dataset Procedure

This procedure runs an SQL query on a dataset, and records the output in another dataset. It is frequently used to reduce, reshape and reindex datasets.

It is particularly useful in order to generate a training dataset for machine learning algorithms, which require a pre-indexed dataset with all of the features in place.

Configuration

A new procedure of type transform named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "transform",
    "params": {
        "inputData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec>,
        "skipEmptyRows": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

inputData
InputQuery

A SQL statement to select the rows from a dataset to be transformed. This supports all MLDB's SQL expressions including but not limited to where, when, order by and group by clauses. These expressions can be used to refine the rows to transform.

outputDataset
OutputDatasetSpec
{"type":"sparse.mutable"}

Output dataset configuration. This may refer either to an existing dataset, or a fully specified but non-existing dataset which will be created by the procedure.

skipEmptyRows
bool
false

Skip rows from the input dataset where no values are selected

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Examples

See also