An SQL Select expression is used to transform a set of input rows into a set of output rows, by applying expressions across the set of columns in the rows. The input and output columns are both named and there are separate parts of the expression to choose the column name and to choose the value of the column.
Since MLDB is a sparse database designed for datasets with up to millions of columns, there is extra syntax which helps you to manipulate groups of columns together.
The following basic syntax (as supported in standard SQL) can be used as normal:
<value-expr> AS <name>clauses will select each expression, and create a matching column in the output with the corresponding name.
value-expris a Value Expression
namemust be a valid column name: it must be in double-quotes (
") if it clashes with a reserved word or doesn't start with a letter and contain only ASCII letters, numbers and underscore characters.
AS <name>portion can be omitted, in which case a name will be automatically generated
In addition, non-SQL-standard
<name>: <value-expr> are also accepted in Select Expressions.
There are also extensions that make it easier to work with millions of columns. The following are accepted within the comma-delimited list of clauses in a select expression:
*will select all columns and copy them into the output.
* EXCLUDING (col1)will select and copy all columns except for the one called
col1to the output.
* EXCLUDING (col1, col2)will select and copy all but
y*will select all columns starting with a
* EXCLUDING (col1, x*)will select and copy all, apart from those called
col1or those whose name starts with an
y* AS z*will select all columns starting with a
y, and change the
zin the output column name.
y* EXCLUDING (yellow) AS z*will do the same thing, but will exclude the column
yellowfrom the output.
MLDB is different from traditional SQL databases in that it supports millions of columns, and columns may be very sparse as there is no enforced schema around the row.
In some instances, it may be advantageous to select columns based upon an
SQL expression rather than based on pattern matching on the name. This is
possible using a column expression, which effectively creates a table with
all columns in it and allows a
SELECT to run over that table to choose the
The syntax looks like this:
COLUMN EXPR ( AS <name> WHERE <value-expr> ORDER BY <order-by-expr> OFFSET <int> LIMIT <int>)
The components of the expression are as follows:
ASclause is a string-valued expression that gives the name of the resulting column. If not specified, it defaults to columnName().
WHEREclause is a boolean-coerced Value Expression that is used to decide if a column will be selected or not. If not specified, it defaults to true.
ORDER BYclause is an Order-By Expression (which only makes sense if
LIMITare used) ranks the columns in the given order, presumably to only take a subset of them.
OFFSET nclause will skip the top
ncolumns that came through the
ORDER BYclause. By default,
nis zero: no rows are skipped.
LIMIT mclause will select only the top
mcolumns. By default
mis infinity: there is no limit.
As an example, to select up to 1,000 columns with the most rows in them, but none that has less than 100, you would use
COLUMN EXPR (WHERE rowCount() > 100 ORDER BY rowCount() DESC, columnName() LIMIT 1000)
Note that this syntax is not part of SQL, it is an MLDB extension.
The following functions are available in the context of a column expression:
columnName()is the name of the column under consideration. It is the same as the
columnPath()elements concatenated with a
.character. Note that using
ORDER BYclause is only useful when combined with
OFFSET. It does not order the columns of the output.
columnPath()is the structured path to the column under consideration.
columnPathElement(n)is the nth element of the column path of the column under consideration. Negative indexing is supported, meaning that if n is less than zero, it will be a distance from the end (for example, -1 is the last element, -2 is the second to last element). For a columnName of
columnPathElement(2)is equivalent to
columnPathElement(-1)which will be
2. If n is bigger than the number of elements in the column path, NULL will be returned which results in an error since column names cannot be of type NULL.
columnPathLength()is the number of elements in the column path.
value()is the value of the column.
rowCount()is the number of rows that have a value for this column, including explicit NULLs.
When you have structured data, by default COLUMN EXPR will process every atomic column in the flattened
representation. If you want to process only the top-most columns in the structured data, you can add the
STRUCTURED keyword to the column expression. For example:
is a structured two-dimensional array. So the expression
COLUMN EXPR (SELECT 1) FROM (SELECT [[2,3],[4,5]])
will return 4 columns while the expression
COLUMN EXPR STRUCTURED (SELECT 1) FROM (SELECT [[2,3],[4,5]])
will return a single column. This is useful when you want to pass structured data to functions. For example:
SELECT COLUMN EXPR STRUCTURED (SELECT norm(value(), 2)) FROM (SELECT [2,3],[4,5])
will select the L2 norm of both vectors.
It is possible to filter out rows based on the value of an expression using the
DISTINCT ON optional
clause. The syntax is as follows:
SELECT DISTINCT ON (<value-expr>) <value-expr>, <value-expr>, [...] FROM <from-expression> ORDER BY <order-by-expr>
The value expression of the
DISTINCT ON clause must match the left-most clause of the
ORDER BY expression. This will
filter out rows so that no two rows have the same value for the
DISTINCT ON clause. For example:
SELECT DISTINCT ON (x) x,y FROM dataset ORDER BY x,y
will return the values
y of each row in
dataset, but will only return the first row for each unique value of