Intro to Datasets

Datasets are schema-less, append-only named sets of data points, which are contained in cells, which sit at the intersection of rows and columns. Data points are composed of a value and a timestamp. Each data point can thus be represented as a (row, column, timestamp, value) tuple, and datasets can be thought of as sparse 3-dimensional matrices. The rows and columns of a dataset can be interchanged by transposing the matrix. Datasets can contain billions of data points arranged in millions of rows with millions of columns.

Rows and columns have structured identifiers, called paths. A path is an array of UTF-8 strings without null characters called path elements (e.g. ['a','b']) and can be represented as strings called names with dots between the path elements (e.g. 'a.b'). Dots can appear in path elements, causing their string representation to be surrounded by double-quote characters, which can themselves appear in path elements, and are doubled up in the string representation: the path ['a.b', '"hello"'] is stringified as "a.b"."""hello""". These rules make it possible to manipulate hierarchically-structured (i.e. JSON) data within SQL queries.

Datasets can be created and data can be appended to them via the REST API and they can also loaded from or saved to files via Procedures.

Matrix View

The following diagram shows this as a 3-dimensional matrix, and shows what happens when we "slice" the matrix at a particular point in time.

Sliced Dataset

Ignoring the time dimension, you can imagine that the data looks something like this:

ColumnA

ColumnB

ColumnC

ColumnD

ColumnE

RowA

RowB

123

7634.2

RowC

"hello"

RowD

RowE

Events View

Another way of looking at the data is as a series of events, as follows:

Timestamp

Row Name

Column Name

Value

2013-04-20 10:02:01

User123

First Name

"Bob"

2013-04-20 10:02:01

User123

Test Score 1

0.78

2013-04-20 10:03:33

User456

First Name

"Jill"

2013-04-20 10:03:33

User456

Test Score 1

0.45

2013-04-22 11:10:22

User123

Test Score 1

0.0

2013-04-22 11:10:22

User123

Revision Reason

"Cheating"

...

...

...

...

The ability to record multiple states of the world at different times is a fundamental part of the design of MLDB as a database for machine learning. The two versions of Bob's score are both recorded, along with the time at which it changed. This allows MLDB to reconstruct what was known at a given point in time, making historical data far more useful to learn from.

Available Dataset Types

Datasets are created via a REST API call with one of the following types:

TypeDescriptionDoc
behMemory-mappable dataset type to efficiently store behavioural data[doc]
beh.binaryMemory-mappable dataset type to efficiently store binary valued behavioural data[doc]
beh.binary.mutableRecordable dataset designed to store binary valued behavioral data[doc]
beh.mutableRecordable dataset designed to store behavioral data[doc]
embeddingDataset to record a set of coordinates per row[doc]
mergedMerges together several datasets into one virtual dataset[doc]
mongodb.datasetDataset type that forwards records to a MongoDB database[doc]
mongodb.recordDataset type that forwards records to a MongoDB database[doc]
sampledDataset that samples another dataset[doc]
sparse.mutableSparse dataset which can be recorded to[doc]
sqliteSparseSQLite-backed fully consistent, persistent, mutable sparse database[doc]
tabularDense dataset which can be recorded to[doc]
transposedDataset that interchanges rows and columns[doc]
unionUnify together several datasets[doc]

See also