This page is part of the documentation for the Machine Learning Database.

It is a static snapshot of a Notebook which you can play with interactively by trying MLDB online now.
It's free and takes 30 seconds to get going.

Recommending Movies¶

The MovieLens 20M dataset contains 20 million user ratings from 1 to 5 of thousands of movies. In this demo we'll build a simple recommendation system which will use this data to suggest 25 movies based on a seed movie you provide.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.

from pymldb import Connection
mldb = Connection()

Download the MovieLens 20M data¶

We'll start by using some command-line tools to download and decompress the data.

%%bash
mkdir -p /mldb_data/data
curl "http://public.mldb.ai/ml-20m.zip" 2>/dev/null  > /mldb_data/data/ml-20m.zip
unzip /mldb_data/data/ml-20m.zip -d /mldb_data/data

Archive:  /mldb_data/data/ml-20m.zip

replace /mldb_data/data/ml-20m/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL
(EOF or read error, treating as "[N]one" ...)

%%bash
head /mldb_data/data/ml-20m/README.txt

Summary
=======

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.

%%bash
head /mldb_data/data/ml-20m/ratings.csv

userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940

Load the data into MLDB¶

See the Loading Data Tutorial guide for more details on how to get data into MLDB.

Here we load a text file and use the pivot aggregator to create a sparse matrix representation of the ratings.

%%time

print mldb.put('/v1/procedures/import_mvlns', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/ratings.csv",
        "outputDataset": "mvlns_ratings_csv",
        "runOnCreation": True
    }
})

print mldb.put('/v1/procedures/process_mvlns', {
    "type": "transform",
    "params": {
        "inputData": """
            select pivot(movieId, rating) as *
            named userId 
            from mvlns_ratings_csv
            group by userId
            """,
        "outputDataset": "mvlns_ratings",
        "runOnCreation": True
    }
})

<Response [201]>
<Response [201]>
CPU times: user 18.4 ms, sys: 4.22 ms, total: 22.7 ms
Wall time: 1min 19s

Take a peek at the dataset¶

We'll use the Query API. Each row is a user, each column is a movie, and the cell value is the rating the user gave the movie.

mldb.query("select * from mvlns_ratings limit 3")

Singular Value Decomposition (SVD)¶

We will create and run a Procedure of type svd.train. This creates an embedding dataset where each row is a movie and the columns represent coordinates in a 100-dimensional space. Similar movies end up closer to each other than dissimilar movies.

print mldb.put('/v1/procedures/mvlns_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : "select COLUMN EXPR (where rowCount() > 3) from mvlns_ratings",
        "columnOutputDataset" : "mvlns_svd_embedding",
        "modelFileUrl": "file://models/mvlns.svd",
        "functionName": "mvlns_svd_embedder",
        "runOnCreation": True
    }
})

<Response [201]>

Explore the results!¶

Our dataset has movieIds but humans think about movie names so we'll load up the movie names in a dataset.

from ipywidgets import interact, interact_manual
from uuid import uuid4

print mldb.put('/v1/procedures/import_movies', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/movies.csv",
        "outputDataset": "movies",
        "select": "title, movieId",
        "named": "movieId",
        "runOnCreation": True
    }
})

<Response [201]>

A simple search function to find all movies (and corresponding movieIds) whose names contain a string.

@interact
def movie_search(x = "toy story"):
    return mldb.query("select title from movies where regex_match(lower(title), '.*%s.*')" % x.strip().lower())

Now let's create a dataset to hold user preferences, and a simple function to simulate a user rating movies they like and movies they dislike, based on the movie_search function above.

print mldb.put("/v1/datasets/mvlns_user_prefs", {"type": "sparse.mutable"})

print mldb.put("/v1/functions/preferences", {
    "type": "sql.query",
    "params": {
        "query": "select {*} as p from mvlns_user_prefs where rowName()=$user"
    }
})

def save_prefs(user_id, likes, dislikes):
    for rating, search_terms in zip([5,1],[likes, dislikes]):
        for x in search_terms.split(","):
            if len(x) > 3:
                mldb.post("/v1/datasets/mvlns_user_prefs/rows", {
                     "rowName":user_id, 
                     "columns": [[str(m), rating, 0] for m in movie_search(x).index]
                })
    mldb.post("/v1/datasets/mvlns_user_prefs/commit", {})

save_prefs("janedoe", "Toy Story", "Terminator")
mldb.query("select preferences({ user: 'janedoe' })[p] as *")

<Response [201]>
<Response [201]>

With all that done, we can now build a recommendation engine out of a simple SQL query by mapping a user's preferences into the same space as the movie embeddings (i.e. embedding the user's preferences) and looking for the nearest movies.

print mldb.put("/v1/functions/nearest_movies", {
    "type": "embedding.neighbors",
    "params": {
        "dataset": "mvlns_svd_embedding",
        "defaultNumNeighbors": 25,
        "columnName": "embedding"
    }
})

print mldb.put("/v1/functions/recommendations", {
    "type": "sql.query",
    "params": {
        "query": """
            select nearest_movies({ 
                    coords: mvlns_svd_embedder({
                        row: preferences({ user: $user })[p]
                    })[embedding] 
                })[distances] as r
        """
    }
})

<Response [201]>
<Response [201]>

Here's a simple function which lets you simulate the results of liking and disliking certain movies and getting back the resulting recommendations.

def recommend(likes="Toy Story, Terminator", dislikes="Star Trek"):
    
    # here we simulate a new user saving these preferences
    user_id = str(uuid4())
    save_prefs(user_id, likes, dislikes)
    
    # we can then run an SQL query to:
    #   - retrieve recommendations
    #   - transpose and join them to movies to get titles
    #   - exclude the already-rated movies from the result
    return mldb.query("""
        select m.title 
        named m.movieId
        from 
            transpose(( select recommendations({ user: '%(user)s' }) )) as r 
            join movies as m on r.rowPathElement(2) = m.rowPathElement(0)
        where m.movieId not in (keys of preferences({ user: '%(user)s' })[p])
        order by r.result
        """ % dict(user=user_id))

recommend(likes="Toy Story, Terminator", dislikes="Star Trek")

Here's an interactive form that lets you play with this function to see if you agree with the recommendations!

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.

interact_manual(recommend)

Where to next?¶

Check out the other Tutorials and Demos.

	21	58	69	163	246	357	441	471	858	908	...	3255	3260	3386	3418	3425	3654	3713	3865	3879	3897
_rowName
138492	3	4.5	4.5	3.5	4	4.5	5	4.5	5	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
138491	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
138490	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4	4	3	4	2	4	4	1	3	4

	title
_rowName
115875	Toy Story Toons: Hawaiian Vacation (2011)
3114	Toy Story 2 (1999)
120474	Toy Story That Time Forgot (2014)
106022	Toy Story of Terror (2013)
78499	Toy Story 3 (2010)
1	Toy Story (1995)
120468	Toy Story Toons: Partysaurus Rex (2012)
115879	Toy Story Toons: Small Fry (2011)

	1	589	1240	3114	4934	6537	68791	78499	102425	106022	115875	115879	120468	120474
_rowName
result	5	1	1	5	1	1	1	5	1	5	5	5	5	5

	m.title
_rowName
316	Stargate (1994)
1682	Truman Show, The (1998)
2617	Mummy, The (1999)
1222	Full Metal Jacket (1987)
2355	Bug's Life, A (1998)
6365	Matrix Reloaded, The (2003)
3527	Predator (1987)
2329	American History X (1998)
292	Outbreak (1995)
145	Bad Boys (1995)
4886	Monsters, Inc. (2001)
2000	Lethal Weapon (1987)
1784	As Good as It Gets (1997)
434	Cliffhanger (1993)
1961	Rain Man (1988)
349	Clear and Present Danger (1994)
2115	Indiana Jones and the Temple of Doom (1984)
5445	Minority Report (2002)
594	Snow White and the Seven Dwarfs (1937)
1641	Full Monty, The (1997)
2003	Gremlins (1984)

	m.title
_rowName
316	Stargate (1994)
1682	Truman Show, The (1998)
2617	Mummy, The (1999)
1222	Full Metal Jacket (1987)
2355	Bug's Life, A (1998)
6365	Matrix Reloaded, The (2003)
3527	Predator (1987)
2329	American History X (1998)
292	Outbreak (1995)
145	Bad Boys (1995)
4886	Monsters, Inc. (2001)
2000	Lethal Weapon (1987)
1784	As Good as It Gets (1997)
434	Cliffhanger (1993)
1961	Rain Man (1988)
349	Clear and Present Danger (1994)
2115	Indiana Jones and the Temple of Doom (1984)
5445	Minority Report (2002)
594	Snow White and the Seven Dwarfs (1937)
1641	Full Monty, The (1997)
2003	Gremlins (1984)