Tabular Dataset

The Tabular Dataset is used to represent dense datasets with more rows than columns. It is ideal for storing text files such as Comma-Separated Values (CSV) files.

The dataset will learn the available columns from the first data that is recorded into it, and will be optimized to record many rows with the same columns. (By setting unknownColumns to add, it's not limited to storing just the columns from the first data, but these will be the ones for which fast storage is pre-allocated).

On a fast machine, it is capable of recording several million rows per second from a 10 column CSV file.

Configuration

A new dataset of type tabular named <id> can be created as follows:

mldb.put("/v1/datasets/"+<id>, {
    "type": "tabular",
    "params": {
        "unknownColumns": <UnknownColumnAction>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

unknownColumns
UnknownColumnAction
"error"

Action to take on unknown columns. Values are 'ignore', 'error' (default), or 'add' which will allow an unlimited number of sparse columns to be added.

Storing non-uniform data

The tabular dataset has support for storing non-uniform data, such as that which comes from a JSON file with varying fields. This can be accessed by setting the unknownColumns field to add. The other possible values are listed below:

Enumeration UnknownColumnAction

ValueDescription
ignore

Unknown columns will be ignored

error

Unknown columns will result in an error

add

Unknown columns will be added as a sparse column

Limitations

The tabular dataset has the following limitations: