The Sparse Matrix Dataset is used to store high cardinality sparse data. It provides a reasonable baseline performance over all data types, cardinalities and data shapes.
It is designed for the following situations:
This dataset type is mutable, and only keeps its data in memory. Saving and loading will come in a future iteration.
The dataset is transactional. Each row or set of rows will atomically become visible on commit.
The dataset is fully indexed. It can efficiently perform both row and column based operations, as well as be transposed.
The dataset can store atomic types. Rows will be flattened upon storage.
Each string based value is only stored once, so longer value like strings can be stored, and the same value may be in many columns at once.
This is an experimental dataset. It is not guaranteed to remain available or compatible across releases.
A new dataset of type sparse.mutable
named <id>
can be created as follows:
mldb.put("/v1/datasets/"+<id>, {
"type": "sparse.mutable",
"params": {
"timeQuantumSeconds": <float>,
"consistencyLevel": <WriteTransactionLevel>,
"favor": <TransactionFavor>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
timeQuantumSeconds | a number that controls the resolution of timestamps stored in the dataset, in seconds. 1 means one second, 0.001 means one millisecond, 60 means one minute. Higher resolution requires more memory to store timestamps. |
consistencyLevel | Transaction level for reading of written values. In the default level, which is |
favor | Whether to favor reads or writes. Only has effect for when |
The consistencyLevel
values are as follows:
WriteTransactionLevel
Value | Description |
---|---|
consistentAfterWrite | A value written will be available immediately after writing. This provides the most consistency as operations are serializable, at the expense of slower writes and reads. |
consistentAfterCommit | A value written will only be guaranteed to be available after a `commit()` call has returned successfully, and may not be readable until that point. This provides much faster write performance and should be used in any batch insertion scenario. |
The favor
values are as follows:
TransactionFavor
Value | Description |
---|---|
favorReads | Values will be written in an indexed manner that favors read speed over write speed. This will reduce the write volume, but make reads fast. |
favorWrites | Values will be written quickly in a non-indexed manner that favors write speed over read speed. Values written will still be readable, but reads may take longer as there are no indexes maintained on recent writes. |
The dataset is transactional, which means that each record operation will atomically become visible once it's completed.
It is better to record to the dataset in chunks of 1000 to 100,000 rows at a time to avoid having too many separate, individually visible chunks of data available at any one time.
The commit
operation will cause the dataset to optimize its internal
storage for maximum query speed. This should be used once the entire
dataset has been recorded or infrequently during recording. Note that
the commit operation can take several seconds on a large dataset and
will block all writes (but not reads) while it's taking place (the
writes will end up completing once the commit operation is done).