An SQL Select expression is used to transform a set of input rows into a set of output rows, by applying expressions across the set of columns in the rows. The input and output columns are both named and there are separate parts of the expression to choose the column name and to choose the value of the column.
Since MLDB is a sparse database designed for datasets with up to millions of columns, there is extra syntax which helps you to manipulate groups of columns together.
The following basic syntax (as supported in standard SQL) can be used as normal:
<value-expr> AS <name> clauses will select each expression, and create a matching column in the output with the corresponding name.
value-expr is a Value Expressionname must be a valid column name: it must be in double-quotes (") if it clashes with a reserved word or doesn't start with a letter and contain only ASCII letters, numbers and underscore characters.AS <name> portion can be omitted, in which case a name will be automatically generatedIn addition, non-SQL-standard <name>: <value-expr> are also accepted in Select Expressions.
There are also extensions that make it easier to work with millions of columns. The following are accepted within the comma-delimited list of clauses in a select expression:
* will select all columns and copy them into the output.* EXCLUDING (col1) will select and copy all columns except for the
one called col1 to the output.* EXCLUDING (col1, col2) will select and copy all but col1 and
col2.y* will select all columns starting with a y* EXCLUDING (col1, x*) will select and copy all, apart from those
called col1 or those whose name starts with an xy* AS z* will select all columns starting with a y, and change
the y to a z in the output column name.y* EXCLUDING (yellow) AS z* will do the same thing, but will
exclude the column yellow from the output.MLDB is different from traditional SQL databases in that it supports millions of columns, and columns may be very sparse as there is no enforced schema around the row.
In some instances, it may be advantageous to select columns based upon an
SQL expression rather than based on pattern matching on the name. This is
possible using a column expression, which effectively creates a table with
all columns in it and allows a SELECT to run over that table to choose the
ones required.
The syntax looks like this:
COLUMN EXPR ( AS <name> WHERE <value-expr> ORDER BY <order-by-expr> OFFSET <int> LIMIT <int>)
The components of the expression are as follows:
AS clause is a string-valued expression that gives the name of the resulting column. If not specified, it defaults to columnName().WHERE clause is a boolean-coerced Value Expression that is used to decide if a column will be selected or not. If not specified, it defaults to true.ORDER BY clause is an Order-By Expression (which only makes sense if OFFSET or LIMIT are used) ranks the columns in the given order, presumably to only take a subset of them.OFFSET n clause will skip the top n columns that came through the ORDER BY clause. By default, n is zero: no rows are skipped.LIMIT m clause will select only the top m columns. By default m is infinity: there is no limit.As an example, to select up to 1,000 columns with the most rows in them, but none that has less than 100, you would use
COLUMN EXPR (WHERE rowCount() > 100 ORDER BY rowCount() DESC, columnName() LIMIT 1000)
Note that this syntax is not part of SQL, it is an MLDB extension.
The following functions are available in the context of a column expression:
columnName() is the name of the column under consideration. It is the same
as the columnPath() elements concatenated with a . character. Note that
using columnName() in an ORDER BY clause is only useful when combined
with LIMIT and/or OFFSET. It does not order the columns of the output.columnPath() is the structured path to the column under consideration.columnPathElement(n) is the nth element of the column path of the column
under consideration. Negative indexing is supported, meaning that if n is less than
zero, it will be a distance from the end (for example, -1 is the last element, -2
is the second to last element). For a columnName of x.y.2, then columnPathElement(0)
will be x, columnPathElement(1) will be y and columnPathElement(2) is equivalent
to columnPathElement(-1) which will be 2. If n is bigger than the number
of elements in the column path, NULL will be returned which results in an error since column
names cannot be of type NULL.columnPathLength() is the number of elements in the column path.value() is the value of the column.rowCount() is the number of rows that have a value for this column, including explicit NULLs.When you have structured data, by default COLUMN EXPR will process every atomic column in the flattened
representation. If you want to process only the top-most columns in the structured data, you can add the
STRUCTURED keyword to the column expression. For example:
SELECT [[2,3],[4,5]])
is a structured two-dimensional array. So the expression
COLUMN EXPR (SELECT 1) FROM (SELECT [[2,3],[4,5]])
will return 4 columns while the expression
COLUMN EXPR STRUCTURED (SELECT 1) FROM (SELECT [[2,3],[4,5]])
will return a single column. This is useful when you want to pass structured data to functions. For example:
SELECT COLUMN EXPR STRUCTURED (SELECT norm(value(), 2)) FROM (SELECT [2,3],[4,5])
will select the L2 norm of both vectors.
It is possible to filter out rows based on the value of an expression using the DISTINCT ON optional
clause. The syntax is as follows:
SELECT DISTINCT ON (<value-expr>) <value-expr>, <value-expr>, [...] FROM <from-expression> ORDER BY <order-by-expr>
The value expression of the DISTINCT ON clause must match the left-most clause of the ORDER BY expression. This will
filter out rows so that no two rows have the same value for the DISTINCT ON clause. For example:
SELECT DISTINCT ON (x) x,y FROM dataset ORDER BY x,y
will return the values x and y of each row in dataset, but will only return the first row for each unique value of
x.