SQL Value Expressions

An SQL Value Expression is the basic building block of an SQL query: it is an expression that evaluates to a single value. A value expression is either a constant, a column reference, or the result of applying operators and/or functions to constants and/or column references.

Table of Contents


A Value expression can include literal representations of constants such as numbers and strings etc.

Column references

To refer to a column, you use its name, which is the string representation of its path, as explained in the Intro to Datasets and must be used in accordance with the quoting rules. So to read the value of the column x and add one, use x + 1.

Unlike in conventional SQL, references to non-existent columns are always evaluated as NULL. A common mistake is to use double-quotes to represent a string, which usually results in a reference to a non-existent column, and therefore NULL.

Referring to columns by dataset

When querying a single dataset, it is allowed but unnecessary to specify the name of that dataset when referring a column. However, in some cases, the query will run on several datasets (See From Expression). In those cases, you need to specify the name or alias of the dataset the desired column is from using a . operator. For example, to refer the column y from a dataset whose name or alias is x, you must use x.y in the value expression.

If the dataset has been aliased (e.g. FROM dataset AS x), you must use the alias x instead of the original name dataset.


The following standard SQL operators are supported by MLDB. An operator with lower precedence binds tighter than one with a higher predecence, so for example x + y * z is the same as x + (y * z). Expressions at the same precedence level are always left associative, that is the expression x / y % z is evaluated as (x / y) % z.

Operator Type Precedence
. indirection 0
@ timestamp association 0
~ unary arithmetic 1
* , / , % binary arithmetic 2
+ , - unary arithmetic 3
+ , - binary arithmetic 3
& , | , ^ binary bitwise 3
= , !=, > , < , >= , <= binary comparison 4
NOT unary boolean 5
AND binary boolean 6
OR binary boolean 7

Operators on time values

Timestamps and time intervals have specific rules when using binary operators. Here are the supported operators and the types that will result from each operation:

Operator Left hand Value Right Hand Value Resulting type
+ , - Timestamp Number* Timestamp
+ , - Timestamp Time Interval Timestamp
+ , - Time Interval Number* Time Interval
+ , - Time Interval Time Interval Time Interval
* , / Time Interval Number Time Interval

*When used in conjunction with Timestamps or Time Intervals, Numbers implicitly represent days.

Note that the operators + and * are commutative in all cases.

BETWEEN expressions

SQL BETWEEN expressions are a shorthand way of testing for an open interval. The expression x BETWEEN y AND z is the same as x >= y AND x <= z except that the x expression will only be evaluated once. It has the same precedence as binary comparisons (= , !=, > , < , >= , <=).

CASE expressions

SQL CASE expressions are used to return different expressions depending upon the value or truth of an expression. There are two flavors:

Simple case statements, which look like

CASE expr
  WHEN val1 THEN result1
  WHEN val2 THEN result2
  ELSE result3

for example,

CASE x % 2
  WHEN 0 THEN 'even'
  ELSE 'odd'

Matched case statements, which look like

  WHEN boolean1 THEN result1
  WHEN boolean2 THEN result2
  ELSE result3

for example,

  WHEN x % 15 = 0 THEN 'multiple of 5 and 3'
  WHEN x % 5 = 0 THEN 'multiple of 5'
  WHEN x % 3 = 0 THEN 'multiple of 3'
  ELSE 'very approximately prime'

In both cases, there are an arbitrary number of WHEN clauses and the ELSE clauses are optional. If no ELSE clause is present and no WHEN clause matches, the result is null.

CAST expressions

SQL CAST expressions allow the type of an expression to be coerced into another type. The main use is to convert between strings and numbers. See also the MLDB Type System.

The syntax is

CAST (expression AS type)

where expression is any SQL value expression, and type is one of the following:

The integer, number and boolean conversions will work with strings and other numbers.

The timestamp conversions will work with strings, which MUST be ISO 8601 strings, and numbers, which are assumed to represent seconds since the 1st of January, 1970, GMT.

A NULL value will always cast to a NULL value. In addition, if it is not possible to convert a value, then a NULL will be returned.

IS [NOT] expressions

These expressions are used to test the type or value of an expression. They bind tightly, that is to say that x + 1 IS NOT NULL would be interpreted as x + (1 IS NOT NULL), which is probably not what was intended. See also the MLDB Type System.

[NOT] IN expression

This expression tests if the value in the left hand side is (or is not) included in a set of values on the right hand side. There are four ways to specify the set on the right hand side:

  1. As a sub-select (x IN (SELECT ...))
  2. As an explicit tuple (x IN (val1, val2, ...))
  3. As the keys of a row expression (x IN (KEYS OF expr))
  4. As the values of a row expression (x IN (VALUES OF expr))

The first two are standard SQL; the second two are MLDB extensions and are made possible by MLDB's sparse data model. It has the same precedence as the unary not (NOT).

IN expression with sub-select

The right hand side can be the result of a sub SELECT statement. For example expr IN (SELECT x FROM dataset) will test if the value expressed by expr is equal to any of the values in the x column of the dataset. If the SELECT statement returns more than a single column, they will all be tested (this is different from standard SQL, which will ignore all but the first column, and due to MLDB's sparse column model).

IN expression with explicit tuple expression

For example: expr IN (3,5,7,11)

IN (KEYS OF ...) expression

For example: expr IN (KEYS OF tokenize('sentence'))

That will evaluate to true if expr is a word within the given sentence.

IN (VALUES OF ...) expression

For example: expr IN (VALUES OF [3, 5, 7, 11])

is equivalent to expr IN (3, 5, 7, 11), but allows a full row expression to be used to construct the set, rather than enumerating tuple elements.

[NOT] LIKE expression

This expression tests if a string on the left-hand side matches an SQL wildcard pattern on the right hand side.

The % character will substitute for 0 or more characters. For example: x LIKE 'abc%' will test if x is a string that starts with abc.

The _ character will substitute for a single character. For example: x LIKE 'a_a' will test if x is a string that has 3 characters that starts and ends with a.

For more intricate patterns, you can use the regex_match function.

This expression has the same precedence as the unary not (NOT).

Calling Functions

Built-in functions (see below for a list) can accept multiple arguments of any type and return a single value of any type and can be applied by name with parameters in parentheses, for example:

built_in_function(1, 'a')

User-defined functions are applied in the same way except that they always accept a single row-valued input value as an argument and return a single row-valued output, for example:

user_defined_function( {some_number: 1, some_string: 'a'} )

It can also accept the row returned from another user-defined function, for example:

user_defined_function_a(user_defined_function_b( {some_number: 1, some_string: 'a'} ))

Furthermore, since it is frequently necessary to access a subset of the columns from the output of a user-defined function, their application can be followed by an accessor in square brackets, for example:

user_defined_function( {some_number: 1, some_string: 'a'} )[ <accessor> ]

Let's look at a hypothetical user-defined function with name example whose type defined the following input and output values:

Accessing the sum_scaled_y output value would look like:

example( {x: 10, y: [1, 2, 3]} )[sum_scaled_y]

Accessing a row containing only the sum_scaled_y and input_length output values would look like:

example( {x: 10, y: [1, 2, 3]} )[ {sum_scaled_y, input_length} ]

Note that this syntax is not part of SQL, it is an MLDB extension.

List of Built-in Functions

Dataset-provided functions

These functions are always available when processing rows from a dataset, and will change values on each row under consideration. See the Intro to Datasets documentation for more information about names and paths.

Path manipulation functions

See the Intro to Datasets documentation for more information about names and paths.

Encoding and decoding functions

  "a": "b", 
  "c": {"d": "e"}, 
  "f": ["g","h"], 
  "i": [ {"j":"k"}, {"l":"m"} ] 

With {arrays: 'parse'} the output will be:

a c.d f.0 f.1 i.0.j i.0.j
'b' 'e' 'g' 'h' 'k' 'm'

With {arrays: 'encode'} the output will be:

a c.d f.g f.h i.0 i.1
'b' 'e' 1 1 '{"j":"k"}' '{"l":"m"}'

The full set of options to the parse_json function are as follows:

Field, Type, DefaultDescription


If true, errors in the JSON are ignored and the element with an error will be silently ignored. If false (the default), a JSON format error will lead to the function failing with an exception.


Describes how arrays are encoded in the JSON output. For ''parse' (default), the arrays become structured values. For 'encode', arrays containing atoms are sparsified with the values representing one-hot keys and boolean true values

and the possible values for the arrays field are:

Enumeration JsonArrayHandling


Arrays will be parsed into nested values


Arrays will be encoded as one-hot values

Numeric functions

expression result
quantize(2.17, 0.001) 2.17
quantize(2.17, 0.01) 2.17
quantize(2.17, 0.1) 2.2
quantize(2.17, 1) 2
quantize(2.17, 10) 0
quantize(-0.1, 1) 0
quantize(0, 10000) 0
quantize(217, 0.1) 217
quantize(217, 1) 217
quantize(217, 10) 220
quantize(217, 100) 200
quantize(-217, 100) -200

More details on the Binomial proportion confidence interval Wikipedia page.

Constant functions

The following functions return numerical constants:

String functions

Timestamp functions

Set operation functions

Vector space functions

Geographical functions

The following functions operate on latitudes and longtitudes and can be used to calculate things to do with locations on Earth:

Signal processing functions

The following functions provide digital signal processing capabilities:

Image processing functions

The following functions provide image processing capabilities:

Field, Type, DefaultDescription


Camera Make


Camera Model




Bits Per Sample


Image Width


Image Height


Image Description


Image Orientation


Image Copyright


Image date/time


Image original date/time


Image digitized date/time


Subsecond time


Exposure time t: 1/t s


F-stop x: f/x


ISO Speed


Subject Distance in meters


Exposure bias in EV


Flash used


Metering mode


Lens focal length


35mm focal length


GPS latitude


GPS longitude


GPS altitude in meters


GPS Precision (DOP)


Lens min focal length in mm


Lens max focal length in mm


Lens f-stop min x: f/x


Lens f-stop max x: f/x


Lens make


Lens model


Focal plane XRes


Focal plane YRes

Blob functions

The following functions are specific to blob data:

Web data functions

The following functions are used to extract and process web data.


Fetches resources from a given file or URL. It acts as the default version of function fetcher. It returns two output columns:


The following query will use fetcher to return the country code from an IP address from an external web service.

SELECT CAST (fetcher('http://www.geoplugin.net/json.gp?ip=')[content] AS STRING)


extract_domain(str, {removeSubdomain: false})

Extracts the domain name from a URL. Setting the option removeSubdomain to true will return only the domain without the subdomain. Note that the string passed in must be a complete and valid URL. If a scheme (http://, etc) is not present, an error will be thrown.

The full set of options to the extract_domain function are as follows:

Field, Type, DefaultDescription


Flag to specify whether or not the subdomain is kept.

See also the http.useragent function type that can be used to parse a user agent string.

Data import functions

Parameters to tokenize and token_extract are as follows:

Field, Type, DefaultDescription


Characters to split on in the tokenization.


a single character to delimit tokens which may contain the splitchars, so by default tokenize('a,"b,c"', {quoteChar:'"'}) will return the row {'a':1,'b,c':1}. By default no quoting character is used.


Skip the first offset tokens of the output (default 0).


Only generate limit tokens in the output (default is -1, which means generate all.


value (if not set to null) will be used instead of token counts for the values of the columns in the output row.


Minimum number of characters in a token for it to be output or included as part of an ngram

TUPLE [ int, int ]

Specifies the complexity of n-grams to return, with the first element corresponding to minimum length and the second to maximum length. [1, 1] will return only unigrams, while [2, 3] will return bigrams and trigrams, where tokens are joined by underscores. For example, tokenize('Good day world', {splitChars:' ', ngramRange:[2,3]})will return the row {'Good_day': 1, 'Good_day_world': 1, 'day_world': 1}

Aggregate Functions

The following standard SQL aggregation functions are supported. They may only be used in SELECT and HAVING clauses. If an aggregation function appears in the SELECT clause and no GROUP BY clause is used, an empty GROUP BY clause will be inferred.

The following useful non-standard aggregation functions are also supported:

Aggregates of rows

Every aggregate function can operate on single columns, just like in standard SQL, but they can also operate on multiple columns via complex types like rows and scalars. This has the effect of calculating a separate aggregate for each column in the input, and returns a row-valued result. For example, to calculate the total count of each column in a dataset, the following would suffice:

SELECT count({*})

which would return a row with one count for each column in the dataset. This functionality is useful to write generic queries that operate without prior knowledge of the column names, and to make queries on datasets with thousands or millions of column feasible.

Vertical, Horizontal and Temporal Aggregation

The standard SQL aggregation functions operate 'vertically' down columns. MLDB datasets are transposable matrices, so MLDB also supports 'horizontal' aggregation. In addition, MLDB supports a third, temporal dimension, so 'temporal' aggregation is also supported:

Evaluating a JavaScript function from SQL

The SQL function jseval allows for the inline definition of functions using Javascript. This function takes the following arguments:

  1. A text string containing the text of the function to be evaluated. This must be a valid Javascript function, which will return with the return function. For example, return x + y. This must be a constant string, it cannot be an expression that is evaluated at run time.
  2. A text string containing the names of all of the parameters that will be passed to the function, as they are referred to within the function. For example, x,y. This must be a constant string, it cannot be an expression that is evaluated at run time.
  3. As many argument as are listed in part 2, in the same order. These can be any SQL expressions and will be bound to the parameters passed in to the function.

The result of the function will be the result of calling the function on the supplied arguments. This will be converted into a result as follows:

In all cases, the timestamp on the output will be equal to the latest of the timestamps on the arguments passed in to the function.

As an example, to calculate the Fibonnaci numbers from SQL (somewhat inefficiently), one could write

SELECT jseval('
function fib(x) {
    if (x == 1) return 1;
    if (x == 2) return 1;
    return fib(x - 1) + fib(x - 2);
return fib(i);
', 'i', i)

or to parse a comma separated list of 'key=value' attributes into a row, one could write

SELECT jseval('
var fields = csv.split(",");
var result = {};
for (var i = 0;  i < fields.length;  ++i) {
    var field = fields[i];
    var kv = field.split("=");
    result[kv[0]] = kv[1];
return result;
', 'csv', expression_to_generate_csv)

The mldb Javascript object is available from the function; this can notably used to log to the console to aid debugging. Documentation for this object can be found with the javascript plugin type documentation.

You can also take a look at the Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial for examples of how to use the jseval function.

Handling errors line by line

When processing a query and an error occurs, the whole query fails and no result is returned, even if only a single line caused the error. The try function is meant to handle this type of situation. The first argument is the expression to try to apply. The optional second argument is what will be returned if an error is encountered. It can be any value expression, including other functions and other try functions. If no second argument is given, the error is returned as a string. The try function is analogous to a try/catch block in other programming languages.

Example usage

SELECT try(parse_json('foo'), 'err')

Here, parse_json('foo') will fail. Since the second argument is provided, the value "err" will be returned.

SELECT try(parse_json('foo'))

Again, parse_json('foo') will fail. Since the second argument was left blank, the error message generated by MLDB will be returned.

If the result of the try function is expected to be a row expression, then both arguments supplied must return row expressions, like in the following example:

SELECT try(parse_json('foo'), {}) AS *

As a counter example, the following two calls will both fail when an error is encoutered because the function will return a string, and strings cannot be used with AS *.

SELECT try(parse_json('foo')) AS *
SELECT try(parse_json('foo'), 'err') AS *

Note that the try function only applies to runtime exceptions, not to syntax errors or bind-time failures.