SQL Select Expressions

An SQL Select expression is used to transform a set of input rows into a set of output rows, by applying expressions across the set of columns in the rows. The input and output columns are both named and there are separate parts of the expression to choose the column name and to choose the value of the column.

Since MLDB is a sparse database designed for datasets with up to millions of columns, there is extra syntax which helps you to manipulate groups of columns together.

Basic Syntax

The following basic syntax (as supported in standard SQL) can be used as normal:

In addition, non-SQL-standard <name>: <value-expr> are also accepted in Select Expressions.

There are also extensions that make it easier to work with millions of columns. The following are accepted within the comma-delimited list of clauses in a select expression:

Selecting columns programatically via a column expression

MLDB is different from traditional SQL databases in that it supports millions of columns, and columns may be very sparse as there is no enforced schema around the row.

In some instances, it may be advantageous to select columns based upon an SQL expression rather than based on pattern matching on the name. This is possible using a column expression, which effectively creates a table with all columns in it and allows a SELECT to run over that table to choose the ones required.

The syntax looks like this:

COLUMN EXPR ( AS <name> WHERE <value-expr> ORDER BY <order-by-expr> OFFSET <int> LIMIT <int>)

The components of the expression are as follows:

As an example, to select up to 1,000 columns with the most rows in them, but none that has less than 100, you would use

COLUMN EXPR (WHERE rowCount() > 100 ORDER BY rowCount() DESC, columnName() LIMIT 1000)

Note that this syntax is not part of SQL, it is an MLDB extension.

Built-in functions available in column expressions

The following functions are available in the context of a column expression:

Selecting structured columns

When you have structured data, by default COLUMN EXPR will process every atomic column in the flattened representation. If you want to process only the top-most columns in the structured data, you can add the STRUCTURED keyword to the column expression. For example:

SELECT [[2,3],[4,5]])

is a structured two-dimensional array. So the expression

COLUMN EXPR (SELECT 1) FROM (SELECT [[2,3],[4,5]])

will return 4 columns while the expression

COLUMN EXPR STRUCTURED (SELECT 1) FROM (SELECT [[2,3],[4,5]])

will return a single column. This is useful when you want to pass structured data to functions. For example:

SELECT COLUMN EXPR STRUCTURED (SELECT norm(value(), 2)) FROM (SELECT [2,3],[4,5])

will select the L2 norm of both vectors.

Filtering duplicated rows based on an expression

It is possible to filter out rows based on the value of an expression using the DISTINCT ON optional clause. The syntax is as follows:

SELECT DISTINCT ON (<value-expr>) <value-expr>, <value-expr>, [...] FROM <from-expression> ORDER BY <order-by-expr>

The value expression of the DISTINCT ON clause must match the left-most clause of the ORDER BY expression. This will filter out rows so that no two rows have the same value for the DISTINCT ON clause. For example:

SELECT DISTINCT ON (x) x,y FROM dataset ORDER BY x,y

will return the values x and y of each row in dataset, but will only return the first row for each unique value of x.

See also