This page describes the type system MLDB uses to store and process data.
MLDB's atomic types are the following:
null
(case-insensitive)null
, unless it is impossible for its value to be unknown. Cases of note:
null = null
will evaluate to null
as neither value is known, so the equality is unknown. The way to check for null values is to use the IS
operator: null IS NULL
will evaluate to true
.false AND x
cannot evaluate to true
for any value of x
so this will evaluate to false
even if x
is null
.true OR x
cannot evaluate to false
for any value of x
so this will evaluate to true
even if x
is null
.1
or -3
true
is a synonym for 1
and false
is a synonym for 0
12.2
, 1.22e1
, inf
, nan
(case-insensitive)'
). Single-quote characters within string literals must be doubled (i.e. ''
). Strings are always encoded as UTF-8 Unicode characters.TIMESTAMP
keyword, for example TIMESTAMP '2010-01-02T23:45:33Z'
or TIMESTAMP 1234
. Strings are parsed as ISO8601 strings and require a timezone offset and numbers are treated as seconds since '1970-01-01T00:00:00Z'.INTERVAL
keyword and a string containing a sequence of values
followed by one of the supported time units: second, minute, hour, day, week, month, and year, e.g. INTERVAL '2 day 37 minute'
The time units can be wholly capilalized (e.g. YEAR
), and are always singular.
The following abreviations also work in upper and lower case: 's' for second, 'm' for minute, 'h' for hour, 'd' for day,
'w' for week, and 'y' for years
A single leading -
in the string will reverse the direction of the interval, e.g. INTERVAL '-3 month 2 week'
.base64_decode
function
with a string as its argument.3
inside
{ "x": [ 1, 2, 3 ] }
will be represented by the path x.2
, ie the structure
x
with the 3rd (zero-based) element. Paths are in some ways similar to
strings, but are internally structured as arrays of elements. See the
Intro to Datasets documentation for more
information about names and paths. A path may contain any UTF-8 or ASCII
characters, including control characters, apart from the null character \0
.Every data point stored or manipulated by MLDB has an associated timestamp, even if the data point is itself of type timestamp. Data point timestamps are specified on row creation or as inputs to procedures which create datasets. Literals appearing in queries have a timestamp of -inf
but any value's timestamp can be modified in a query with the special @
operator (which takes the same right-hand value as the TIMESTAMP
keyword). For example, 1 @ '2010-01-02T23:45:33Z'
will have a finite timestamp in 2010.
Within SQL expressions and functions, the type system is more sophisticated. In addition to the types mentioned above, the following are permitted:
{ }
). For example {1 as x, 'a' as y}
or the equivalent {x: 1, y: 'a'}
.Embeddings, which model coordinates in a multi dimensional space. Each is addressed by one or more coordinates, starting at zero and counting up by one.
An embedding can be:
Embeddings can only contain atomic types as elements, not complex types.
Embeddings can appear as literals in queries as comma-delimited
Value Expressions surrounded by square brackets
(i.e. [ ]
). For example [1, 2, 3]
for a vector or
[ [x,0], [0,x] ]
for a diagonal matrix that depends upon the value
of the variable x
(which must be an atomic type).
When comparing rows, MLDB first sorts the columns by name and performs a lexicographical comparison of the column's names and values. To illustrate this, consider these rows:
id | unflattened_row |
---|---|
id_1 | { python : 1, java : 1, c++ : 3 } |
id_2 | { scala : 4, java : 3, c++ : 1 } |
id_3 | { python : 1, ada : 2 } |
The rows are ordered in this way by MLDB when doing comparison:
id | unflattened_row |
---|---|
id_3 | { ada : 2, python : 1 } |
id_2 | { c++ : 1, java : 3, scala : 4 } |
id_1 | { c++ : 3, java : 1, python : 1 } |
Similarly, MLDB uses embedding's values to lexicographical order embeddings.
When a complex type is returned as part of an SQL query result or stored in a dataset, it is flattened into a set of columns with atomic values.
Rows are flattened column by column into the parent row with their existing name, either unprefixed if using the query syntax as *
or prefixed with <prefix>.
if using the query syntax as <prefix>
.
For example, select {x: 1, y: 2} as output, {x: 3, y: 4} as *
yields
output.x | output.y | x | y |
---|---|---|---|
1 | 2 | 3 | 4 |
Embeddings are flattened by creating one column per value, with
the name being an incrementing integer from 0 upwards, prefixed with <prefix>.
if using the query syntax as <prefix>
, otherwise a prefix will be automatically generated.
For example, select [1,2] as x
yields
x.0 | x.1 |
---|---|
4 | 6 |
As a result, sorting by column names where there are more than 9 columns may give strange results, with 10 sorting before 2. This can be addressed at the application level.