The SQLite Sparse Dataset is a dataset that stores its data on local files inside the MLDB container using the popular SQLite3 database.
This dataset has the following characteristics:
The main disadvantage of the SQLite Sparse Dataset is that it is slow to
record data in to it and it uses quite a bit of disk space. Typically, no
more than a few thousand rows can be recorded per second (although this may
be higher on systems with mldb_data
mounted on an SSD) and the size of the
file on disk will be significantly larger than a text file containing the
data inserted.
It is used internally in MLDB to store internal operational data.
This dataset cannot be used to access or manipulate tables in SQLite databases that it did not create.
A new dataset of type sqliteSparse
named <id>
can be created as follows:
mldb.put("/v1/datasets/"+<id>, {
"type": "sqliteSparse",
"params": {
"dataFileUrl": <Url>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
dataFileUrl | URI (must be file://) under which the database data lives |
The dataFileUrl
parameter must be a file://
URL, as SQLite only works
with local files. If the parameter is empty, then a temporary, in-memory
database will be used that is not persisted to disk.
The database includes three tables: vals
, rows
and cols
.
The rows
table has three columns: rowNum
, rowHash
and rowName
. Each
row in this table describes a row in the sparse dataset.
The cols
table has three columns: colNum
, colHash
and colName
. Each
row in this table describes a column in the sparse dataset.
The vals
table contains the actual data inserted into the table. Each value
recorded is a separate row. The columns are rowNum
(a small integer giving
the index into the rows
table), colNum
(a small integer giving the index
into the cols
table), val
(a JSON-encoded string of the value in the cell)
and ts
(timestamp, the integral number of milliseconds since the eopch).
The rows
and cols
tables exist as an optimization, to avoid recording the
same row and column names over and over.