SQL Value Expressions

An SQL Value Expression is the basic building block of an SQL query: it is an expression that evaluates to a single value. A value expression is either a constant, a column reference, or the result of applying operators and/or functions to constants and/or column references.

Table of Contents

Constants

A Value expression can include literal representations of constants such as numbers and strings etc.

Column references

To refer to a column, you use its name, which is the string representation of its path, as explained in the Intro to Datasets and must be used in accordance with the quoting rules. So to read the value of the column x and add one, use x + 1.

Unlike in conventional SQL, references to non-existent columns are always evaluated as NULL. A common mistake is to use double-quotes to represent a string, which usually results in a reference to a non-existent column, and therefore NULL.

Referring to columns by dataset

When querying a single dataset, it is allowed but unnecessary to specify the name of that dataset when referring a column. However, in some cases, the query will run on several datasets (See From Expression). In those cases, you need to specify the name or alias of the dataset the desired column is from using a . operator. For example, to refer the column y from a dataset whose name or alias is x, you must use x.y in the value expression.

If the dataset has been aliased (e.g. FROM dataset AS x), you must use the alias x instead of the original name dataset.

Operators

The following standard SQL operators are supported by MLDB. An operator with lower precedence binds tighter than one with a higher predecence, so for example x + y * z is the same as x + (y * z). Expressions at the same precedence level are always left associative, that is the expression x / y % z is evaluated as (x / y) % z.

Operator Type Precedence
. indirection 0
@ timestamp association 0
~ unary arithmetic 1
* , / , % binary arithmetic 2
+ , - unary arithmetic 3
+ , - binary arithmetic 3
& , | , ^ binary bitwise 3
= , !=, > , < , >= , <= binary comparison 4
NOT unary boolean 5
AND binary boolean 6
OR binary boolean 7

Operators on time values

Timestamps and time intervals have specific rules when using binary operators. Here are the supported operators and the types that will result from each operation:

Operator Left hand Value Right Hand Value Resulting type
+ , - Timestamp Number* Timestamp
+ , - Timestamp Time Interval Timestamp
+ , - Time Interval Number* Time Interval
+ , - Time Interval Time Interval Time Interval
* , / Time Interval Number Time Interval

*When used in conjunction with Timestamps or Time Intervals, Numbers implicitly represent days.

Note that the operators + and * are commutative in all cases.

BETWEEN expressions

SQL BETWEEN expressions are a shorthand way of testing for an open interval. The expression x BETWEEN y AND z is the same as x >= y AND x <= z except that the x expression will only be evaluated once. It has the same precedence as binary comparisons (= , !=, > , < , >= , <=).

CASE expressions

SQL CASE expressions are used to return different expressions depending upon the value or truth of an expression. There are two flavors:

Simple case statements, which look like

CASE expr
  WHEN val1 THEN result1
  WHEN val2 THEN result2
  ELSE result3
END

for example,

CASE x % 2
  WHEN 0 THEN 'even'
  ELSE 'odd'
END

Matched case statements, which look like

CASE
  WHEN boolean1 THEN result1
  WHEN boolean2 THEN result2
  ELSE result3
END

for example,

CASE
  WHEN x % 15 = 0 THEN 'multiple of 5 and 3'
  WHEN x % 5 = 0 THEN 'multiple of 5'
  WHEN x % 3 = 0 THEN 'multiple of 3'
  ELSE 'very approximately prime'
END

In both cases, there are an arbitrary number of WHEN clauses and the ELSE clauses are optional. If no ELSE clause is present and no WHEN clause matches, the result is null.

CAST expressions

SQL CAST expressions allow the type of an expression to be coerced into another type. The main use is to convert between strings and numbers. See also the MLDB Type System.

The syntax is

CAST (expression AS type)

where expression is any SQL value expression, and type is one of the following:

The integer, number and boolean conversions will work with strings and other numbers.

The timestamp conversions will work with strings, which MUST be ISO 8601 strings, and numbers, which are assumed to represent seconds since the 1st of January, 1970, GMT.

A NULL value will always cast to a NULL value. In addition, if it is not possible to convert a value, then a NULL will be returned.

IS [NOT] expressions

These expressions are used to test the type or value of an expression. They bind tightly, that is to say that x + 1 IS NOT NULL would be interpreted as x + (1 IS NOT NULL), which is probably not what was intended. See also the MLDB Type System.

[NOT] IN expression

This expression tests if the value in the left hand side is (or is not) included in a set of values on the right hand side. There are four ways to specify the set on the right hand side:

  1. As a sub-select (x IN (SELECT ...))
  2. As an explicit tuple (x IN (val1, val2, ...))
  3. As the keys of a row expression (x IN (KEYS OF expr))
  4. As the values of a row expression (x IN (VALUES OF expr))

The first two are standard SQL; the second two are MLDB extensions and are made possible by MLDB's sparse data model. It has the same precedence as the unary not (NOT).

IN expression with sub-select

The right hand side can be the result of a sub SELECT statement. For example expr IN (SELECT x FROM dataset) will test if the value expressed by expr is equal to any of the values in the x column of the dataset. If the SELECT statement returns more than a single column, they will all be tested (this is different from standard SQL, which will ignore all but the first column, and due to MLDB's sparse column model).

IN expression with explicit tuple expression

For example: expr IN (3,5,7,11)

IN (KEYS OF ...) expression

For example: expr IN (KEYS OF tokenize('sentence'))

That will evaluate to true if expr is a word within the given sentence.

IN (VALUES OF ...) expression

For example: expr IN (VALUES OF [3, 5, 7, 11])

is equivalent to expr IN (3, 5, 7, 11), but allows a full row expression to be used to construct the set, rather than enumerating tuple elements.

[NOT] LIKE expression

This expression tests if a string on the left-hand side matches an SQL wildcard pattern on the right hand side.

The % character will substitute for 0 or more characters. For example: x LIKE 'abc%' will test if x is a string that starts with abc.

The _ character will substitute for a single character. For example: x LIKE 'a_a' will test if x is a string that has 3 characters that starts and ends with a.

For more intricate patterns, you can use the regex_match function.

This expression has the same precedence as the unary not (NOT).

Calling Functions

Built-in functions (see below for a list) can accept multiple arguments of any type and return a single value of any type and can be applied by name with parameters in parentheses, for example:

built_in_function(1, 'a')

User-defined functions are applied in the same way except that they always accept a single row-valued input value as an argument and return a single row-valued output, for example:

user_defined_function( {some_number: 1, some_string: 'a'} )

It can also accept the row returned from another user-defined function, for example:

user_defined_function_a(user_defined_function_b( {some_number: 1, some_string: 'a'} ))

Furthermore, since it is frequently necessary to access a subset of the columns from the output of a user-defined function, their application can be followed by an accessor in square brackets, for example:

user_defined_function( {some_number: 1, some_string: 'a'} )[ <accessor> ]

Let's look at a hypothetical user-defined function with name example whose type defined the following input and output values:

Accessing the sum_scaled_y output value would look like:

example( {x: 10, y: [1, 2, 3]} )[sum_scaled_y]

Accessing a row containing only the sum_scaled_y and input_length output values would look like:

example( {x: 10, y: [1, 2, 3]} )[ {sum_scaled_y, input_length} ]

Note that this syntax is not part of SQL, it is an MLDB extension.

List of Built-in Functions

Dataset-provided functions

These functions are always available when processing rows from a dataset, and will change values on each row under consideration. See the Intro to Datasets documentation for more information about names and paths.

Path manipulation functions

See the Intro to Datasets documentation for more information about names and paths.

Encoding and decoding functions

{
  "a": "b", 
  "c": {"d": "e"}, 
  "f": ["g","h"], 
  "i": [ {"j":"k"}, {"l":"m"} ] 
}

With {arrays: 'parse'} the output will be:

a c.d f.0 f.1 i.0.j i.0.j
'b' 'e' 'g' 'h' 'k' 'm'

With {arrays: 'encode'} the output will be:

a c.d f.g f.h i.0 i.1
'b' 'e' 1 1 '{"j":"k"}' '{"l":"m"}'

The full set of options to the parse_json function are as follows:

Field, Type, DefaultDescription

ignoreErrors
bool
false

If true, errors in the JSON are ignored and the element with an error will be silently ignored. If false (the default), a JSON format error will lead to the function failing with an exception.

arrays
JsonArrayHandling
"parse"

Describes how arrays are encoded in the JSON output. For ''parse' (default), the arrays become structured values. For 'encode', arrays containing atoms are sparsified with the values representing one-hot keys and boolean true values

and the possible values for the arrays field are:

Enumeration JsonArrayHandling

ValueDescription
parse

Arrays will be parsed into nested values

encode

Arrays will be encoded as one-hot values

Numeric functions

expression result
quantize(2.17, 0.001) 2.17
quantize(2.17, 0.01) 2.17
quantize(2.17, 0.1) 2.2
quantize(2.17, 1) 2
quantize(2.17, 10) 0
quantize(-0.1, 1) 0
quantize(0, 10000) 0
quantize(217, 0.1) 217
quantize(217, 1) 217
quantize(217, 10) 220
quantize(217, 100) 200
quantize(-217, 100) -200

More details on the Binomial proportion confidence interval Wikipedia page.

Constant functions

The following functions return numerical constants:

String functions

Timestamp functions

Set operation functions

Vector space functions

Geographical functions

The following functions operate on latitudes and longtitudes and can be used to calculate things to do with locations on Earth:

Signal processing functions

The following functions provide digital signal processing capabilities:

Image processing functions

The following functions provide image processing capabilities:

Field, Type, DefaultDescription

cameraMake
ExpressionValue
[null,"-Inf"]

Camera Make

cameraModel
ExpressionValue
[null,"-Inf"]

Camera Model

software
ExpressionValue
[null,"-Inf"]

Software

bitsPerSample
ExpressionValue
[null,"-Inf"]

Bits Per Sample

imageWidth
ExpressionValue
[null,"-Inf"]

Image Width

imageHeight
ExpressionValue
[null,"-Inf"]

Image Height

imageDescription
ExpressionValue
[null,"-Inf"]

Image Description

imageOrientation
ExpressionValue
[null,"-Inf"]

Image Orientation

imageCopyright
ExpressionValue
[null,"-Inf"]

Image Copyright

imageDateTime
ExpressionValue
[null,"-Inf"]

Image date/time

originalDateTime
ExpressionValue
[null,"-Inf"]

Image original date/time

digitizedDateTime
ExpressionValue
[null,"-Inf"]

Image digitized date/time

subsecondTime
ExpressionValue
[null,"-Inf"]

Subsecond time

exposureTime
ExpressionValue
[null,"-Inf"]

Exposure time t: 1/t s

fStop
ExpressionValue
[null,"-Inf"]

F-stop x: f/x

isoSpeed
ExpressionValue
[null,"-Inf"]

ISO Speed

subjectDistance
ExpressionValue
[null,"-Inf"]

Subject Distance in meters

exposureBias
ExpressionValue
[null,"-Inf"]

Exposure bias in EV

flashUsed
ExpressionValue
[null,"-Inf"]

Flash used

meteringMode
ExpressionValue
[null,"-Inf"]

Metering mode

lensFocalLength
ExpressionValue
[null,"-Inf"]

Lens focal length

focalLength35mm
ExpressionValue
[null,"-Inf"]

35mm focal length

gpsLat
ExpressionValue
[null,"-Inf"]

GPS latitude

gpsLon
ExpressionValue
[null,"-Inf"]

GPS longitude

gpsAltitude
ExpressionValue
[null,"-Inf"]

GPS altitude in meters

gpsPrecision
ExpressionValue
[null,"-Inf"]

GPS Precision (DOP)

lensMinFocalLength
ExpressionValue
[null,"-Inf"]

Lens min focal length in mm

lensMaxFocalLength
ExpressionValue
[null,"-Inf"]

Lens max focal length in mm

lensFstopMin
ExpressionValue
[null,"-Inf"]

Lens f-stop min x: f/x

lensFstopMax
ExpressionValue
[null,"-Inf"]

Lens f-stop max x: f/x

lensMake
ExpressionValue
[null,"-Inf"]

Lens make

lensModel
ExpressionValue
[null,"-Inf"]

Lens model

focalPlaneXres
ExpressionValue
[null,"-Inf"]

Focal plane XRes

focalPlaneYres
ExpressionValue
[null,"-Inf"]

Focal plane YRes

Blob functions

The following functions are specific to blob data:

Web data functions

The following functions are used to extract and process web data.

fetcher(str)

Fetches resources from a given file or URL. It acts as the default version of function fetcher. It returns two output columns:

Example

The following query will use fetcher to return the country code from an IP address from an external web service.

SELECT CAST (fetcher('http://www.geoplugin.net/json.gp?ip=158.245.13.123')[content] AS STRING)

Limitations

extract_domain(str, {removeSubdomain: false})

Extracts the domain name from a URL. Setting the option removeSubdomain to true will return only the domain without the subdomain. Note that the string passed in must be a complete and valid URL. If a scheme (http://, etc) is not present, an error will be thrown.

The full set of options to the extract_domain function are as follows:

Field, Type, DefaultDescription

removeSubdomain
bool
false

Flag to specify whether or not the subdomain is kept.

See also the http.useragent function type that can be used to parse a user agent string.

Data import functions

Parameters to tokenize and token_extract are as follows:

Field, Type, DefaultDescription

splitChars
string
","

Characters to split on in the tokenization.

quoteChar
string

a single character to delimit tokens which may contain the splitchars, so by default tokenize('a,"b,c"', {quoteChar:'"'}) will return the row {'a':1,'b,c':1}. By default no quoting character is used.

offset
int
0

Skip the first offset tokens of the output (default 0).

limit
int
-1

Only generate limit tokens in the output (default is -1, which means generate all.

value
CellValue

value (if not set to null) will be used instead of token counts for the values of the columns in the output row.

minTokenLength
int
1

Minimum number of characters in a token for it to be output or included as part of an ngram

ngramRange
TUPLE [ int, int ]
[1,1]

Specifies the complexity of n-grams to return, with the first element corresponding to minimum length and the second to maximum length. [1, 1] will return only unigrams, while [2, 3] will return bigrams and trigrams, where tokens are joined by underscores. For example, tokenize('Good day world', {splitChars:' ', ngramRange:[2,3]})will return the row {'Good_day': 1, 'Good_day_world': 1, 'day_world': 1}

Aggregate Functions

The following standard SQL aggregation functions are supported. They may only be used in SELECT and HAVING clauses. If an aggregation function appears in the SELECT clause and no GROUP BY clause is used, an empty GROUP BY clause will be inferred.

The following useful non-standard aggregation functions are also supported:

Aggregates of rows

Every aggregate function can operate on single columns, just like in standard SQL, but they can also operate on multiple columns via complex types like rows and scalars. This has the effect of calculating a separate aggregate for each column in the input, and returns a row-valued result. For example, to calculate the total count of each column in a dataset, the following would suffice:

SELECT count({*})

which would return a row with one count for each column in the dataset. This functionality is useful to write generic queries that operate without prior knowledge of the column names, and to make queries on datasets with thousands or millions of column feasible.

Vertical, Horizontal and Temporal Aggregation

The standard SQL aggregation functions operate 'vertically' down columns. MLDB datasets are transposable matrices, so MLDB also supports 'horizontal' aggregation. In addition, MLDB supports a third, temporal dimension, so 'temporal' aggregation is also supported:

Evaluating a JavaScript function from SQL

The SQL function jseval allows for the inline definition of functions using Javascript. This function takes the following arguments:

  1. A text string containing the text of the function to be evaluated. This must be a valid Javascript function, which will return with the return function. For example, return x + y. This must be a constant string, it cannot be an expression that is evaluated at run time.
  2. A text string containing the names of all of the parameters that will be passed to the function, as they are referred to within the function. For example, x,y. This must be a constant string, it cannot be an expression that is evaluated at run time.
  3. As many argument as are listed in part 2, in the same order. These can be any SQL expressions and will be bound to the parameters passed in to the function.

The result of the function will be the result of calling the function on the supplied arguments. This will be converted into a result as follows:

In all cases, the timestamp on the output will be equal to the latest of the timestamps on the arguments passed in to the function.

As an example, to calculate the Fibonnaci numbers from SQL (somewhat inefficiently), one could write

SELECT jseval('
function fib(x) {
    if (x == 1) return 1;
    if (x == 2) return 1;
    return fib(x - 1) + fib(x - 2);
}
return fib(i);
', 'i', i)

or to parse a comma separated list of 'key=value' attributes into a row, one could write

SELECT jseval('
var fields = csv.split(",");
var result = {};
for (var i = 0;  i < fields.length;  ++i) {
    var field = fields[i];
    var kv = field.split("=");
    result[kv[0]] = kv[1];
}
return result;
', 'csv', expression_to_generate_csv)

The mldb Javascript object is available from the function; this can notably used to log to the console to aid debugging. Documentation for this object can be found with the javascript plugin type documentation.

You can also take a look at the Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial for examples of how to use the jseval function.

Handling errors line by line

When processing a query and an error occurs, the whole query fails and no result is returned, even if only a single line caused the error. The try function is meant to handle this type of situation. The first argument is the expression to try to apply. The optional second argument is what will be returned if an error is encountered. It can be any value expression, including other functions and other try functions. If no second argument is given, the error is returned as a string. The try function is analogous to a try/catch block in other programming languages.

Example usage

SELECT try(parse_json('foo'), 'err')

Here, parse_json('foo') will fail. Since the second argument is provided, the value "err" will be returned.

SELECT try(parse_json('foo'))

Again, parse_json('foo') will fail. Since the second argument was left blank, the error message generated by MLDB will be returned.

If the result of the try function is expected to be a row expression, then both arguments supplied must return row expressions, like in the following example:

SELECT try(parse_json('foo'), {}) AS *

As a counter example, the following two calls will both fail when an error is encoutered because the function will return a string, and strings cannot be used with AS *.

SELECT try(parse_json('foo')) AS *
SELECT try(parse_json('foo'), 'err') AS *

Note that the try function only applies to runtime exceptions, not to syntax errors or bind-time failures.