This page is part of the documentation for the Machine Learning Database.

It is a static snapshot of a Notebook which you can play with interactively by trying MLDB online now.
It's free and takes 30 seconds to get going.

Selecting Columns Programmatically Using Column Expressions Tutorial¶

MLDB provides a complete implementation of the SQL SELECT statement. Most of the functions you are accustomed to using are available in your queries.

MLDB is different from traditional SQL databases in that there is no enforced schema on rows, allowing you to work with millions of columns of sparse data. This makes it easy to load and manipulate sparse datasets, even when there are millions of columns. To reduce the size of your dataset or use only specific variables, we may need to select columns based on specific critera. Column Expressions is an MLDB extension that provides additional control over your column selection. With a column expression, you can programmatically return specific columns with a SQL SELECT statement.

In this tutorial, we will provide examples of COLUMN EXPR within SELECT statements. This tutorial assumes familiarity with Procedures and Datasets. We suggest going through the Procedures and Functions Tutorial and the Loading Data Tutorial beforehand.

Setting up¶

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.

from pymldb import Connection
mldb = Connection()

Basic usage example¶

Let's begin by loading and visualizing our data. We will be using the dataset from the Virtual Manipulation of Datasets Tutorial. We had chosen the tokenize function to count the number of words in the Wikipedia descriptions of several Machine Learning concepts (please check out the tutorial for more details).

print mldb.put("/v1/procedures/import_ML_concepts", {
    "type":"import.text",
    "params": {
        "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
        "outputDataset":{
            "id":"ml_concepts",
            "type": "sparse.mutable"
        },
        "named": "Concepts",
        "select": """ 
            tokenize(
                lower(Text), 
                {splitChars: ' -''"?!;:/[]*,().',  
                minTokenLength: 4}) AS *
        """,
        "runOnCreation": True
    }
})

<Response [201]>

Each word is represented by a column and each Machine Learning concept by a row. We can run a simple SELECT query to take a quick look at the first 5 rows of our dataset.

mldb.query("SELECT * FROM ml_concepts LIMIT 5")

There are 286 columns, some of which may or may not be useful to the data analysis at hand. For example, we may want to rebuild a dataset with:

verbs and adverbs that end with "ing"
words that appear at least twice in each of the descriptions of the Machine Learning concepts.

This can be done in a few queries as you will see below.

Using column expressions to keep columns that end with "ing"¶

Column Expressions provide efficient ways of picking and choosing our columns. For example, we can only choose verbs and adverbs that end with "ing" to understand the overall meaning of a description.

We use the columnName column expression function along with the LIKE SQL expression, as you will see below.

mldb.query("""
    SELECT COLUMN EXPR (WHERE columnName() LIKE '%ing')
    FROM ml_concepts
    LIMIT 5
""")

This is very powerful because the LIKE statement in Standard SQL is typically found in row operations and more rarely in column operations. MLDB makes it simple to use such SQL expressions on columns.

Using column expressions to keep columns that appear in multiple descriptions¶

With Column Expressions, we can select columns based on specific row selection criteria. COLUMN EXPR</code> will allow us for example to choose words that appear in multiple descriptions. In this case, we filter on words that show up at least 4 times.


To achieve the desired outcome, we use a Built-in Function available in column expressions called rowCount. rowCount iterates through each column and returns the number of rows that have a value for the specific column.

mldb.query("""
    SELECT COLUMN EXPR (WHERE rowCount() > 4)
    FROM ml_concepts
""")

The results make sense. The words that we found above in the columns are common in Machine Learning concept descriptions. With a plain SQL statement and the rowCount function, we reduced our dataset to include words that appear at least 4 times.

Nested JSON example¶

Nested JSON objects can have complex schemas, often involving multi-level and multidimensional data structures. In this section we will create a more complex dataset to illustrate ways to simplify data structures and column selection with Built-in Function and Column Expression.

Let's first create an empty dataset called 'toy_example'.

# create dataset
print mldb.put('/v1/datasets/toy_example', { "type":"sparse.mutable" })

<Response [201]>

We will now create one row in the 'toy_example' dataset with the 'row1' JSON object below.

import json

row1 = {
  "name": "Bob", 
  "address": {"city": "Montreal", "street": "Stanley"}, 
  "sports": ["soccer","hockey"], 
  "friends": [{"name": "Mich", "age": 25}, {"name": "Jean", "age": 28}]
}

# update dataset by adding a row
mldb.post('/v1/datasets/toy_example/rows', {
    "rowName": "row1",
    "columns": [["data", json.dumps(row1), 0]]
})
# save changes
mldb.post("/v1/datasets/toy_example/commit")

We will check out our data with a SELECT query.

mldb.query("SELECT * FROM toy_example")

There are many elements within the cell above. We will need to better structure elements within the nested JSON object.

Working with nested JSON objects with built-in functions and column expressions¶

To understand and query nested JSON objects, we will be using a Built-in Function called parse_json</code> and a Column Expression columnPathElement</code>.

This is where the parse_json function comes in handy. It will help us turn a multidimensional JSON object into a 2D dataset.

mldb.query("""
        SELECT parse_json(data, {arrays: 'parse'}) AS * 
        FROM toy_example
""")

parse_json is a powerful feature since we can create 2D representations out of multidimensional data. We can read all of the elements of the JSON object on one line. It is also easier to SQL as we will see below.

columnPathElement makes it convenient to navigate specific parts of the data structure. In the next block of code, we will do the following:

use parse_json to parse each data element of the object on one row (same as above)
select specific cells using columnPathElement where the the column path name at index = 2 is 'name' (note that 'friends' is at index = 0)

mldb.query("""
    SELECT COLUMN EXPR (WHERE columnPathElement(2) = 'name') 
    FROM (
        SELECT parse_json(data, {arrays: 'parse'}) AS * NAMED rowPath() FROM toy_example
    )
""")

We now know the name of Bob's two friends... As you may have noticed, this is very practical if we want to query a specific attribute of a nested object. The columnPathElement Column Expression allows us to easily query specific JSON data levels or dimensions.

Column operations such as the ones shown in this tutorial can be difficult without column expressions. Column Expressions offer a compact and flexible way to programmatically select columns. It is a great tool to carve out the data that is most needed for your analysis.

In this tutorial, we covered three Column Expressions:

columnName which returns the name of the columns inside our dataset
rowCount which returns the number of non-empty rows for each column
columnPathElement which allows us to chose columns at specific sub-levels

Where to next?¶

Check out the other Tutorials and Demos.

	addition	algorithm	algorithms	also	analysis	analyze	applications	approach	assigns	associated	...	popularized	provide	rather	recurrent	serve	stored	systems	threshold	understanding	wrong
_rowName
Support vector machine	1	2	1	1	1	1	1	1	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Logistic regression	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Deep belief network	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Restricted boltzmann machines	NaN	1	2	NaN	NaN	NaN	1	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Hopfield network	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	...	1	1	1	1	1	1	1	1	1	1

	belonging	clustering	learning	making	mapping	performing	preprocessing	training	using	estimating	being	starting	depending	filtering	modelling	understanding
_rowName
Support vector machine	1	3	5	1	1	1	1	2	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Logistic regression	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3	1	NaN	NaN	NaN	NaN	NaN	NaN
Deep belief network	NaN	NaN	2	NaN	NaN	NaN	NaN	2	NaN	NaN	1	1	NaN	NaN	NaN	NaN
Restricted boltzmann machines	NaN	NaN	2	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN	1	1	1	NaN
Hopfield network	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1

	learning	machine	machines	model	training	used	with	network	neural
_rowName
Support vector machine	5	1	2	2	2	2	1	NaN	NaN
Logistic regression	NaN	NaN	NaN	4	NaN	1	1	NaN	NaN
Deep belief network	2	1	1	1	2	NaN	1	4	1
Restricted boltzmann machines	2	1	3	NaN	1	NaN	1	1	1
Hopfield network	NaN	NaN	NaN	1	NaN	NaN	1	2	1
Naive bayes classifier	2	1	1	1	1	1	4	NaN	NaN
Boltzmann machine	3	2	2	NaN	1	NaN	1	2	2
Autoencoder	2	NaN	NaN	NaN	NaN	2	NaN	2	1
Artificial neural network	1	1	NaN	NaN	NaN	1	NaN	NaN	3

	data
_rowName
row1	{"sports": ["soccer", "hockey"], "friends": [{...

	address.city	address.street	friends.0.age	friends.0.name	friends.1.age	friends.1.name	name	sports.0	sports.1
_rowName
row1	Montreal	Stanley	25	Mich	28	Jean	Bob	soccer	hockey

	friends.0.name	friends.1.name
_rowName
row1	Mich	Jean