This page is part of the documentation for the Machine Learning Database.

It is a static snapshot of a Notebook which you can play with interactively by trying MLDB online now.
It's free and takes 30 seconds to get going.

Spam Filtering Using The Enron Dataset ¶

from pymldb import Connection
mldb = Connection('http://localhost/')

Let's start by loading the dataset. We have already merged the different email files in a sensible manner into a .csv file, which we've made available online. Since this dataset is actually made up of six different datasets, we'll restrict ourself to the first one for simplicity, using a "where" clause.

print mldb.post('/v1/procedures', {
    'type': 'import.text',
    'params': {
        'dataFileUrl': 'http://public.mldb.ai/datasets/enron.csv.gz',
        'outputDataset': 'enron_data',
        'named': "'enron_' + dataset + '_mail_' + index",
        'where': 'dataset = 1',
        'runOnCreation': True
        }
    })

<Response [201]>

This is what the dataset looks like.

index: order in which the emails arrived in the user's inbox
msg: actual content of the email
label: was the email legitimate (ham) or not (spam)

mldb.query('select index, msg, label from enron_data order by index limit 10')

Let's create a sql.expression that will simply tokenize the emails into a bag of words. Those will be our features on which we will train a classifier.

print mldb.put('/v1/functions/bow', {
    'type': 'sql.expression',
    'params': {
        'expression': """
            tokenize(msg, {splitChars: ' :.-!?''"()[],', quoteChar: ''}) as bow
            """
    }
})

<Response [201]>

Then we can generate the features for the whole dataset, and write them into a new dataset, using the transform procedure.

print mldb.post('/v1/procedures', {
    'type': 'transform',
    'params': {
        'inputData': """
            select bow({msg})[bow] as *, label = 'spam' as message_is_spam
            from enron_data
            """,
        'outputDataset': 'enron_features',
        'runOnCreation': True
    }
})

<Response [201]>

Here is a snapshot of the sparse feature matrix:

mldb.query('select * from enron_features limit 10')

Finally, let's train a very simple classifier, by training on half of the messages, and testing on the other half. This classifier will give a score to every email, and we can then choose a threshold where everything above the threshold is classified as spam, and every thing below as ham.

res = mldb.post('/v1/procedures', {
    'type': 'classifier.experiment',
    'params': {
        'experimentName': 'enron_experiment1',
        'inputData': '''
            select 
                {* excluding(message_is_spam)} as features, 
                message_is_spam as label 
            from enron_features''',
        'modelFileUrlPattern': 'file://enron_model_$runid.cls',
        'algorithm': 'bbdt',
        'runOnCreation': True
    }
})

print 'AUC =', res.json()['status']['firstRun']['status']['folds'][0]['resultsTest']['auc']

AUC = 0.992623778886

This is an impressive-looking AUC!

But the AUC score of a classifier is only a very generic measure of performance. When having a specific problem like spam filtering, we're better off using a performance metric that truly matches our intuition about what a good spam filter ought to be. Namely, a good spam filtering algorithm should almost never flag as spam a legitime email, while keeping your inbox as spam-free as possible. This is what should be used to choose the threshold for the classifier, and then to measure its performance.

So instead of the AUC (that doesn't pick a specific threshold but uses all of them), let's use as our performance metric the best $F_{0.05}$ score, which gives 20 times more importance to precision than recall. In other words, this metric represents the fact that classifying as spam only what is really spam is 20 times more important than finding all the spam.

Let's see how we are doing with that metric.

print mldb.put('/v1/functions/enron_score', {
    'type': 'sql.expression',
    'params': {
        'expression': """
            (1 + pow(ratio, 2)) * (precision * recall) / (precision * pow(ratio, 2) + recall) as enron_score
            """
    }
})

<Response [201]>

mldb.query("""
    select 
        "falseNegatives" as spam_in_inbox, 
        "trueNegatives"  as ham_in_inbox,
        "falsePositives" as ham_in_junkmail, 
        "truePositives"  as spam_in_junkmail, 
        enron_score({precision, recall, ratio:0.05}) as *
    named 'best_score'
    from enron_experiment1_results_0
    order by enron_score desc
    limit 1
""")

As you can see, in order to maximize our score (i.e. to get very few ham messages in the junkmail folder) we have to accept a very high proportion of spam in our inbox, meaning that even though we have a very impressive-looking AUC score, our spam filter isn't actually very good!

You can read more about the dangers of relying too much on AUC and the benefits of using a problem-specific measure in our Machine Learning Meets Economics series of blog posts.

Where to next?¶

Check out the other Tutorials and Demos.

	index	label	msg
_rowName
enron_1_mail_0	0	spam	Subject: dobmeos with hgh my energy level has ...
enron_1_mail_1	1	spam	Subject: your prescription is ready . . oxwq s...
enron_1_mail_2	2	ham	Subject: christmas tree farm pictures
enron_1_mail_3	3	ham	Subject: vastar resources , inc .gary , produc...
enron_1_mail_4	4	ham	Subject: calpine daily gas nomination- calpine...
enron_1_mail_5	5	ham	Subject: re : issuefyi - see note below - alre...
enron_1_mail_6	6	ham	Subject: meter 7268 nov allocationfyi .- - - -...
enron_1_mail_7	7	spam	Subject: get that new car 8434people nowthe we...
enron_1_mail_8	8	ham	Subject: mcmullen gas for 11 / 99jackie ,since...
enron_1_mail_9	9	spam	Subject: await your responsedear partner ,we a...

	/	24	Subject	agent	apache	can	coming	cornhusker	cynergy	daren	...	tadalafil	talkingabout	them	three	tongue	treatment	two	under	ve	viagra
_rowName
enron_1_mail_3056	1	1	1	1	3	1	1	1	2	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_2982	10	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_1331	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_2763	6	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_388	9	NaN	1	NaN	NaN	1	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_1775	2	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_2745	2	NaN	1	NaN	NaN	1	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_1100	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_2738	NaN	NaN	1	NaN	NaN	2	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
enron_1_mail_3067	6	NaN	1	NaN	NaN	2	NaN	NaN	NaN	NaN	...	1	1	1	1	1	1	1	1	1	1

	enron_score	ham_in_inbox	ham_in_junkmail	spam_in_inbox	spam_in_junkmail
_rowName
best_score	0.994153	1880	1	416	346

Spam Filtering Using The Enron Dataset¶

Where to next?¶

Spam Filtering Using The Enron Dataset ¶