The Tabular Dataset is used to represent dense datasets with more rows than columns. It is ideal for storing text files such as Comma-Separated Values (CSV) files.
The dataset will learn the available columns from the first data that is
recorded into it, and will be optimized to record many rows with the same
columns. (By setting unknownColumns
to add
, it's not limited to
storing just the columns from the first data, but these will be the ones
for which fast storage is pre-allocated).
On a fast machine, it is capable of recording several million rows per second from a 10 column CSV file.
A new dataset of type tabular
named <id>
can be created as follows:
mldb.put("/v1/datasets/"+<id>, {
"type": "tabular",
"params": {
"unknownColumns": <UnknownColumnAction>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
unknownColumns | Action to take on unknown columns. Values are 'ignore', 'error' (default), or 'add' which will allow an unlimited number of sparse columns to be added. |
The tabular dataset has support for storing non-uniform data, such as that
which comes from a JSON file with varying fields. This can be accessed
by setting the unknownColumns
field to add
. The other possible values
are listed below:
UnknownColumnAction
Value | Description |
---|---|
ignore | Unknown columns will be ignored |
error | Unknown columns will result in an error |
add | Unknown columns will be added as a sparse column |
The tabular dataset has the following limitations:
csv.export
procedure type.