Transform Dataset Procedure

This procedure runs an SQL query on a dataset, and records the output in another dataset. It is frequently used to reduce, reshape and reindex datasets.

It is particularly useful in order to generate a training dataset for machine learning algorithms, which require a pre-indexed dataset with all of the features in place.

Configuration

A new procedure of type transform named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "transform",
    "params": {
        "inputData": <InputQuery>,
        "outputDataset": <OutputDatasetSpec>,
        "skipEmptyRows": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
inputData InputQuery	A SQL statement to select the rows from a dataset to be transformed. This supports all MLDB's SQL expressions including but not limited to where, when, order by and group by clauses. These expressions can be used to refine the rows to transform.
outputDataset OutputDatasetSpec `{"type":"sparse.mutable"}`	Output dataset configuration. This may refer either to an existing dataset, or a fully specified but non-existing dataset which will be created by the procedure.
skipEmptyRows bool `false`	Skip rows from the input dataset where no values are selected
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Examples

The Loading Data Tutorial notebook

Transform Dataset Procedure

Configuration

Examples

See also