This page is part of the documentation for the Machine Learning Database.

It is a static snapshot of a Notebook which you can play with interactively by trying MLDB online now.
It's free and takes 30 seconds to get going.

Predicting Titanic Survival¶

From the description of a Kaggle Machine Learning Challenge at https://www.kaggle.com/c/titanic

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

In this demo we will use MLDB to train a classifier to predict whether a passenger would have survived the Titanic disaster.

Initializing `pymldb` and other imports¶

In this demo, we will use pymldb to interact with the REST API: see the Using pymldb Tutorial for more details.

from pymldb import Connection
mldb = Connection("http://localhost")

#we'll need these also later!
import numpy as np
import pandas as pd, matplotlib.pyplot as plt, seaborn, ipywidgets
%matplotlib inline

Checking out the Titanic dataset¶

From https://www.kaggle.com/c/titanic

Load up the data¶

See the Loading Data Tutorial guide for more details on how to get data into MLDB.

mldb.put('/v1/procedures/import_titanic_raw', { 
    "type": "import.text",
    "params": { 
        "dataFileUrl": "http://public.mldb.ai/titanic_train.csv",
        "outputDataset": "titanic_raw",
        "runOnCreation": True
    } 
})

{
  "status": {
    "firstRun": {
      "runStarted": "2016-12-16T15:50:08.7610252Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-12-16T15:50:08.804541Z", 
      "id": "2016-12-16T15:50:08.760858Z-463496b56263af05", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "titanic_raw", 
      "runOnCreation": true, 
      "dataFileUrl": "http://public.mldb.ai/titanic_train.csv"
    }, 
    "type": "import.text", 
    "id": "import_titanic_raw"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_titanic_raw"
}

Let's look at the data¶

See the Query API documentation for more details on SQL queries.

mldb.query("select * from titanic_raw limit 5")

As a first step in the modelling process, it is often very useful to look at summary statistics to get a sense of the data. To do so, we will create a Procedure of type summary.statistics and store the results in a new dataset called titanic_summary_stats:

print mldb.post("/v1/procedures", {
    "type": "summary.statistics",
    "params": {
        "inputData": "SELECT * FROM titanic_raw",
        "outputDataset": "titanic_summary_stats",
        "runOnCreation": True
    }
})

<Response [201]>

We can take a look at numerical columns:

mldb.query("""
    SELECT * EXCLUDING(value.most_frequent_items*) 
    FROM titanic_summary_stats 
    WHERE value.data_type='number'
""").transpose()

Training a classifier¶

We will create another Procedure of type classifier.experiment. The configuration parameter defines a Random Forest algorithm.

result = mldb.put('/v1/procedures/titanic_train_scorer', {
    "type": "classifier.experiment",
    "params": {
        "experimentName": "titanic",
        "inputData": """
            select 
                {Sex, Age, Fare, Embarked, Parch, SibSp, Pclass} as features,
                label
            from titanic_raw
        """,
        "configuration": {
            "type": "bagging",
            "num_bags": 10,
            "validation_split": 0,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 10,
                "random_feature_propn": 0.3
            }
        },
        "kfold": 3,
        "modelFileUrlPattern": "file://models/titanic.cls",
        "keepArtifacts": True,
        "outputAccuracyDataset": True,
        "runOnCreation": True
    }
})

auc = np.mean([x["resultsTest"]["auc"] for x in result.json()["status"]["firstRun"]["status"]["folds"]])
print "\nArea under ROC curve = %0.4f\n" % auc

Area under ROC curve = 0.8311

We automatically get a REST API for predictions¶

The procedure above created for us a Function of type classifier.

@ipywidgets.interact
def score( Age=[0,80],Embarked=["C", "Q", "S"], Fare=[1,100], Parch=[0,8], Pclass=[1,3], 
            Sex=["male", "female"], SibSp=[0,8]):
    return mldb.get('/v1/functions/titanic_scorer_0/application', input={"features": locals()})

{
  "output": {
    "score": 0.2905334234237671
  }
}

What's in a score?¶

Scores aren't probabilities, but they can be used to create binary classifiers by applying a cutoff threshold. MLDB's classifier.experiment procedure outputs a dataset which you can use to figure out where you want to set that threshold.

test_results = mldb.query("select * from titanic_results_0 order by score desc")
test_results.head()

Here's an interactive way to graphically explore the tradeoffs between the True Positive Rate and the False Positive Rate, using what's called a ROC curve.

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.

@ipywidgets.interact
def test_results_plot( threshold_index=[0,len(test_results)-1]):
    row = test_results.iloc[threshold_index]
    cols = ["trueNegatives","falsePositives","falseNegatives","truePositives",]
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    test_results.plot(ax=ax1, x="falsePositiveRate", y="truePositiveRate", 
    legend=False, title="ROC Curve, threshold=%.4f" % row.score).set_ylabel('truePositiveRate')
    ax1.plot(row.falsePositiveRate, row.truePositiveRate, 'gs')
    
    ax2.pie(row[cols], labels=cols, autopct='%1.1f%%', startangle = 90,
            colors=['lightskyblue','lightcoral','lightcoral', 'lightskyblue'])
    ax2.axis('equal')
    f.subplots_adjust(hspace=.75)
    plt.show()

But what is the model doing under the hood?¶

Let's create a function of type classifier.explain to help us understand what's happening here.

mldb.put('/v1/functions/titanic_explainer', { 
    "id": "titanic_explainer", 
    "type": "classifier.explain",
    "params": { "modelFileUrl": "file://models/titanic.cls" }
})

{
  "status": {
    "mode": "regression", 
    "summary": "COMMITTEE"
  }, 
  "config": {
    "params": {
      "modelFileUrl": "file://models/titanic.cls"
    }, 
    "type": "classifier.explain", 
    "id": "titanic_explainer"
  }, 
  "state": "ok", 
  "type": "classifier.explain", 
  "id": "titanic_explainer"
}

Exploring the impact of features for a single example¶

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.

@ipywidgets.interact
def sliders( Age=[0,80],Embarked=["C", "Q", "S"], Fare=[1,100], Parch=[0,8], Pclass=[1,3], 
            Sex=["male", "female"], SibSp=[0,8]):
    features = locals()
    x = mldb.get('/v1/functions/titanic_explainer/application', input={"features": features, "label": 1}).json()["output"]
   
    df = pd.DataFrame(
        {"%s=%s" % (feat, str(features[feat])): val for (feat, (val, ts)) in x["explanation"]}, 
        index=["val"]).transpose().cumsum()
    pd.DataFrame(
        {"cumulative score": [x["bias"]]+list(df.val)+[df.val[-1]]}, 
        index=['bias'] + list(df.index) + ['final']
    ).plot(kind='line', drawstyle='steps-post', legend=False, figsize=(15, 5), 
           ylim=(-1, 1), title="Score = %.4f" % df.val[-1]).set_ylabel('Cumulative Score')
    
    plt.show()

Summing up explanation values to get overall feature importance¶

When we sum up the explanation values in the context of the correct label, we can get an indication of how important each feature was to making a correct classification.

df = mldb.query("""
select label, sum(
    titanic_explainer({
        label: label, 
        features: {Sex, Age, Fare, Embarked, Parch, SibSp, Pclass}
    })[explanation]
) as *
from titanic_raw group by label
""")
df.set_index("label").transpose().plot(kind='bar', title="Feature Importance", figsize=(15, 5))
plt.xticks(rotation=0)
plt.show()

We can also load up a custom UI for this¶

mldb.put('/v1/plugins/pytanic', {
    "type":"python",
    "params": {"address": "git://github.com/datacratic/mldb-pytanic-plugin"}
})

{
  "config": {
    "params": {
      "address": "git://github.com/datacratic/mldb-pytanic-plugin"
    }, 
    "type": "python", 
    "id": "pytanic"
  }, 
  "state": "ok", 
  "type": "python", 
  "id": "pytanic"
}

Now you can browse to the plugin UI.

NOTE: this only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.

Where to next?¶

Check out the other Tutorials and Demos.

	Age	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Ticket	label	Cabin
_rowName
2	22	S	7.2500	BraundMr.OwenHarris	0	1	3	male	1	A/521171	0	None
3	38	C	71.2833	CumingsMrs.JohnBradley(FlorenceBriggsThayer)	0	2	1	female	1	PC17599	1	C85
4	26	S	7.9250	HeikkinenMiss.Laina	0	3	3	female	0	STON/O2.3101282	1	None
5	35	S	53.1000	FutrelleMrs.JacquesHeath(LilyMayPeel)	0	4	1	female	1	113803	1	C123
6	35	S	8.0500	AllenMr.WilliamHenry	0	5	3	male	0	373450	0	None

_rowName	Fare	SibSp	PassengerId	label	Age	Pclass	Parch
value.1st_quartile	7.8958	0	223	0	20	2	0
value.3rd_quartile	31	1	669	1	38	3	0
value.avg	32.2042	0.523008	446	0.383838	29.6991	2.30864	0.381594
value.data_type	number	number	number	number	number	number	number
value.max	512.329	8	891	1	80	3	6
value.median	14.4542	0	446	0	28	3	0
value.min	0	0	1	0	0.42	1	0
value.num_null	0	0	0	0	177	0	0
value.num_unique	248	7	891	2	88	3	7
value.stddev	49.6934	1.10274	257.354	0.486592	14.5265	0.836071	0.806057

	accuracy	falseNegatives	falsePositiveRate	falsePositives	index	label	precision	recall	score	trueNegatives	truePositiveRate	truePositives	weight
_rowName
601	0.620072	106	0.000000	0	1	1	1.00	0.009346	0.781576	172	0.009346	1	1
558	0.623656	105	0.000000	0	2	1	1.00	0.018692	0.768337	172	0.018692	2	1
488	0.627240	104	0.000000	0	3	1	1.00	0.028037	0.741543	172	0.028037	3	1
700	0.623656	104	0.005814	1	4	0	0.75	0.028037	0.730499	171	0.028037	3	1
617	0.627240	103	0.005814	1	5	1	0.80	0.037383	0.730317	171	0.037383	4	1