This page is part of the documentation for the Machine Learning Database.
It is a static snapshot of a Notebook which you can play with interactively by trying MLDB online now.
It's free and takes 30 seconds to get going.
This demo will show how to do simple data exploration using MLDB. We will use the the Offshore Leaks Database from ICIJ. It contains data about offshore entities from the Panama Papers and the Offshore Leaks investigations. We will be showing off the capabilities of MLDB's SQL engine and how easy it is to create simple charts using a few external python libraries. Note that even though the data is coming from our servers, no transformation has been made to the original data: all the parsing and transformations are done here using MLDB.
WARNING: This demo requires 4GB of memory to run. If you are running this demo on VirtualBox, make sure enough memory is allocated to the virtual machine or the notebook may not run successfully.
from pymldb import Connection
mldb = Connection()
# loading some plotting libraries and setting some defaults
%matplotlib inline
import matplotlib.pyplot as plt
from pylab import rcParams
plt.style.use('fivethirtyeight')
rcParams['figure.figsize'] = 15, 5
rcParams['figure.facecolor'] = 'white'
rcParams['axes.facecolor'] = 'white'
from bokeh.charts import Line, show
from bokeh.models import HoverTool
from bokeh.io import output_notebook
output_notebook()
When using real-life data to do data science, a lot of time is spend getting the data in the right format. In this example, the original date in the data is not formatted correctly to be parsed into a timestamp. This is something we need to easily extract the year from it later on. One very powerful feature of MLDB is the jseval
function, that allows us to execute arbitrary JavaScript code inline in SQL. This allows us to use JavaScript's Date object to easily parse the date for us. The other columns are straightforwardly loaded by MLDB.
data_dir = 'http://public.mldb.ai/datasets/offshore_leaks/'
print mldb.post("/v1/procedures", {
"type": 'import.text',
"params": {
"dataFileUrl": data_dir + 'Entities.csv.gz',
"delimiter": ',',
"encoding": 'latin1',
"runOnCreation": True,
"outputDataset": 'entities_raw',
"ignoreBadLines": True,
}
})
print mldb.put('/v1/functions/parse_date', {
'type': 'sql.expression',
'params': {
'expression': "jseval('return d ? new Date(d) : null;', 'd', date) as parsed_date"
}
})
print mldb.post('/v1/procedures', {
'type': 'transform',
'params': {
'inputData': """
SELECT
* EXCLUDING(incorporation_date, inactivation_date, struck_off_date, dorm_date),
parse_date({date: incorporation_date})[parsed_date] AS incorporation_date,
parse_date({date: inactivation_date})[parsed_date] AS inactivation_date,
parse_date({date: struck_off_date})[parsed_date] AS struck_off_date,
parse_date({date: dorm_date})[parsed_date] AS dorm_date
FROM entities_raw
""",
'outputDataset': 'entities',
'runOnCreation': True
}
})
This is what the created dataset looks like:
mldb.query("SELECT * FROM entities LIMIT 5")
We can now count the number of offshore companies per jurisdiction (the country when they are legally located).
def bar_plot(dataframe, title='', print_table=True):
""" plot the result of a query in a bar chart """
if print_table:
print dataframe
ax = dataframe.plot(kind='bar')
for tick in ax.get_xticklabels():
tick.set_rotation(70)
ax.set_title(title)
ax.set_xlabel('')
return ax
data = mldb.query("""
SELECT count(*) AS count
NAMED jurisdiction_description
FROM entities
WHERE jurisdiction != 'XXX'
GROUP BY jurisdiction_description
ORDER BY count DESC
""")
bar_plot(data, 'Nb of entities per jurisdiction')
We can also look at the number of offshore companies created each year. Here we use the date_part
function to extract the year, and then we GROUP BY
on that year and count
the number of companies.
bar_plot(mldb.query("""
SELECT count(*) as count
NAMED year
FROM (
SELECT date_part('year', incorporation_date) AS year
FROM entities
WHERE incorporation_date IS NOT NULL
)
WHERE year BETWEEN 1977 AND 2016
GROUP BY year
ORDER BY year
"""), title='nb of incorporated offshore companies in the panama papers', print_table=False)
We can do something similar but also group on the country, and see how many offshore companies were created per year per country. We only keep the top countries using a IN
statement.
data = mldb.query("""
SELECT
year, country, count(*) AS count
FROM (
SELECT date_part('year', incorporation_date) AS year,
jurisdiction_description AS country
FROM entities
WHERE incorporation_date IS NOT NULL
AND (jurisdiction_description IN (
SELECT jurisdiction_description
FROM entities
WHERE jurisdiction != 'XXX'
GROUP BY jurisdiction_description
ORDER BY count(*) DESC
LIMIT 20
)
)
)
WHERE year BETWEEN 1977 AND 2016
GROUP BY country, year
ORDER BY country, year
""")
hover = HoverTool(tooltips=[("year", "$x"),("count", "$y"),("country", "@country")])
line = Line(data, x='year', y='count', color='country', legend='top_left', width=1000, tools=[hover])
show(line)
On thing that pops out is that the British Virgin Islands is a very popular choice! Also, we can see that in the beginning of the 2000s, the number of offshore companies created in Niue and the Bahamas dropped in favour of Panama, Samoa and the Seychelles.
Now let's have a look at intermediaries (banks, law firms, etc.) helping in the process of creating an offshore compagny.
print mldb.post("/v1/procedures", {
"type": 'import.text',
"params": {
"dataFileUrl": data_dir + 'Intermediaries.csv.gz',
"delimiter": ',',
"encoding": 'latin1',
"runOnCreation": True,
"outputDataset": 'intermediaries',
"ignoreBadLines": True,
}
})
mldb.query("SELECT * FROM intermediaries LIMIT 5")
Let's see how they are distributed geographically. First we need split the countries, using the tokenize
function, because some of them are linked to multiple countries (separated by a semicolon in the countries
column). Then we transpose
the dataset, because we want the countries to be the rows. Finally, for each row (i.e. each country), we calculate the horizontal_sum
of all the intermediaries.
data = mldb.query("""
SELECT horizontal_sum({*}) as nb_intermediaries
FROM transpose((
SELECT tokenize(countries, {splitChars: ';'}) AS *
FROM intermediaries
WHERE countries IS NOT NULL
))
WHERE horizontal_sum({*}) > 100
ORDER BY nb_intermediaries DESC
""")
bar_plot(data, 'Nb of intermediaries per country')
Now let's have a look at the Officers' data, the owners being the offshore entities.
print mldb.post("/v1/procedures", {
"type": 'import.text',
"params": {
"dataFileUrl": data_dir + 'Officers.csv.gz',
"delimiter": ',',
"encoding": 'latin1',
"runOnCreation": True,
"outputDataset": 'officers',
"ignoreBadLines": True
}
})
mldb.query('SELECT * FROM officers LIMIT 5')
And we also need all the links between those officers. The next dataset actually contains links between everything we have loaded so far. They are called "edges" as edges in a graph, and our entities/intermediaries/officers are the nodes of the graph.
print mldb.post("/v1/procedures", {
"type": 'import.text',
"params": {
"dataFileUrl": data_dir + 'all_edges.csv.gz',
"delimiter": ',',
"encoding": 'latin1',
"runOnCreation": True,
"outputDataset": 'edges',
"ignoreBadLines": True,
"select": "* EXCLUDING (rel_type), lower(rel_type) AS rel_type"
}
})
mldb.query('SELECT * FROM edges LIMIT 5')
Let's explore a bit and list the main type of links in the graph.
mldb.query("""
SELECT count(*) AS count
NAMED rel_type
FROM edges
GROUP BY rel_type
ORDER BY count DESC
LIMIT 20
""")
We can then use those links to see for example what officers are shareholders of what offshore entities, using a simple JOIN
. We can then filter to keep only the "the bearer" (or "el portador" in Spanish) kind of shareholders. Those are the most secretive ones, not keeping track of who is really behind the shares. Finally we group those by the country of the offshore company and by year.
data = mldb.query("""
SELECT count(*) as count, country, year
FROM (
SELECT en.jurisdiction_description as country, date_part('year', en.incorporation_date) as year, en.jurisdiction
FROM edges as e
JOIN officers as o
ON o.node_id = e.node_1
JOIN entities as en
ON en.node_id = e.node_2
WHERE ((lower(o.name) LIKE '%the bearer%') OR (lower(o.name) LIKE '%el portador%'))
AND e.rel_type = 'shareholder of'
AND (en.jurisdiction != 'XXX')
)
GROUP BY country, year""")
hover = HoverTool(tooltips=[("year", "$x"),("count", "$y"),("country", "@country")])
line = Line(data, x='year', y='count', color='country', legend='top_left', width=1000, tools=[hover],
title='Number of "bearer" shareholders per offshore jurisdictions')
show(line)
You can clearly see when the British Virgin Islands started to add more regulation on bearer shares, and when they switched to other countries.
This was a quick overview of how we can use MLDB to explore data, using its powerful yet simple SQL-like interface. You can find more advanced uses of MLDB in our other demos and tutorials.