Mutable Sparse Matrix Dataset

The Sparse Matrix Dataset is used to store high cardinality sparse data. It provides a reasonable baseline performance over all data types, cardinalities and data shapes.

It is designed for the following situations:

This dataset type is mutable, and only keeps its data in memory. Saving and loading will come in a future iteration.

The dataset is transactional. Each row or set of rows will atomically become visible on commit.

The dataset is fully indexed. It can efficiently perform both row and column based operations, as well as be transposed.

The dataset can store atomic types. Rows will be flattened upon storage.

Each string based value is only stored once, so longer value like strings can be stored, and the same value may be in many columns at once.

This is an experimental dataset. It is not guaranteed to remain available or compatible across releases.

Configuration

A new dataset of type sparse.mutable named <id> can be created as follows:

mldb.put("/v1/datasets/"+<id>, {
    "type": "sparse.mutable",
    "params": {
        "timeQuantumSeconds": <float>,
        "consistencyLevel": <WriteTransactionLevel>,
        "favor": <TransactionFavor>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

timeQuantumSeconds
float
1.0

a number that controls the resolution of timestamps stored in the dataset, in seconds. 1 means one second, 0.001 means one millisecond, 60 means one minute. Higher resolution requires more memory to store timestamps.

consistencyLevel
WriteTransactionLevel
"consistentAfterCommit"

Transaction level for reading of written values. In the default level, which is consistentAfterCommit, a value is only guaranteed to be readable after a commit (so it may seem like data is being lost if read before a commit) but writes are fast. With the consistentAfterWrite level, a written value can immediately be read back but writes are slower.

favor
TransactionFavor
"favorReads"

Whether to favor reads or writes. Only has effect for when consistencyLevel is set to consistentAfterWrite.

The consistencyLevel values are as follows:

Enumeration WriteTransactionLevel

ValueDescription
consistentAfterWrite

A value written will be available immediately after writing. This provides the most consistency as operations are serializable, at the expense of slower writes and reads.

consistentAfterCommit

A value written will only be guaranteed to be available after a `commit()` call has returned successfully, and may not be readable until that point. This provides much faster write performance and should be used in any batch insertion scenario.

The favor values are as follows:

Enumeration TransactionFavor

ValueDescription
favorReads

Values will be written in an indexed manner that favors read speed over write speed. This will reduce the write volume, but make reads fast.

favorWrites

Values will be written quickly in a non-indexed manner that favors write speed over read speed. Values written will still be readable, but reads may take longer as there are no indexes maintained on recent writes.

Committing

The dataset is transactional, which means that each record operation will atomically become visible once it's completed.

It is better to record to the dataset in chunks of 1000 to 100,000 rows at a time to avoid having too many separate, individually visible chunks of data available at any one time.

The commit operation will cause the dataset to optimize its internal storage for maximum query speed. This should be used once the entire dataset has been recorded or infrequently during recording. Note that the commit operation can take several seconds on a large dataset and will block all writes (but not reads) while it's taking place (the writes will end up completing once the commit operation is done).

See also