This page is part of the documentation for the Machine Learning Database.

It is a static snapshot of a Notebook which you can play with interactively by trying MLDB online now.
It's free and takes 30 seconds to get going.

Virtual Manipulation of Datasets Tutorial¶

In this tutorial, we will be using two Dataset types, namely Sampled Dataset and Transposed Dataset. This tutorial assumes familiarity with Procedures. We suggest going through the Procedures and Functions Tutorial beforehand.

To run the examples below, we created a toy dataset using the description of machine learning concepts from Wikipedia. Our dataset is made up of two columns. The first column contains the name of the machine learning concept. The second column contains the corresponding description.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.

from pymldb import Connection
mldb = Connection("http://localhost")

Loading the Dataset¶

A powerful feature of MLDB allows us to execute SQL code while loading data. Here, the tokenize function first splits the columns into (lowercased) individual words, removing words with a length of less than 4 characters. Notice that we have chosen to use a dataset of type sparse.mutable since we do know ahead time the column size for each row.

print mldb.put('/v1/procedures/import_ML_concepts', {
        "type":"import.text",
        "params": 
        {
            "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
            "outputDataset":{
                "id":"ml_concepts", # can be accessed later using id
                "type": "sparse.mutable" # sparse mutable dataset is needed since we tokenize words below
            },
            "named": "Concepts", #row name expression for output dataset
            "select": 
                """ 
                    tokenize(
                        lower(Text), 
                        {splitChars: ' -''"?!;:/[]*,().',  
                        minTokenLength: 4}) AS *
                """, # within the tokenize function:
                        # lower case
                        # leave out punctuation and spaces
                        # allow only words > 4 characters
            "runOnCreation": True
        }
    }
)

<Response [201]>

A quick look at the data¶

We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it. You will notice that certain cells have a 'NaN' value as seen below. This is because the dataset is sparse: every word is not present in the description of every concept. Those missing values are representend as NaNs in a Pandas DataFrame.

mldb.query("SELECT * FROM ml_concepts LIMIT 5")

Exploring the data¶

Let's count the number of words that describes each concept.

mldb.query(
    """
    SELECT horizontal_sum({*}) AS count
    FROM ml_concepts
    """
)

Taking the transpose of data¶

There are two ways to take the transpose of a dataset:

1. using a transposed dataset
2. using a transpose SQL function

print mldb.put("/v1/datasets/transposed_concepts", {
        "type": "transposed",
        "params": 
        {
            "dataset": {"id":"ml_concepts"}
        }
    }
)

<Response [201]>

The code above transposed the data and created the transposed_concepts dataset. Now that we have transposed our data, we can easily see which words are most frequently used in the Machine Learning concept descriptions. Please note that we have joined below the results from the transposed_concepts dataset and the transpose inline function to compare outputs.

mldb.query(
    """
        SELECT 
            horizontal_count({a.*}) AS top_words_transp_dataset,
            horizontal_count({b.*}) AS top_words_transp_function
        NAMED a.rowName()
        FROM transposed_concepts AS a
        JOIN transpose(ml_concepts) AS b
            ON a.rowName() = b.rowName()
            ORDER BY top_words_transp_dataset DESC
            LIMIT 10
    """
)

While the methods yield similar results, the transposed dataset allows the user to keep the output dataset in memory. This may be useful if you want to use the dataset in multiple steps of your data analysis. In cases where you don't need to use the transpose more than once, you can simply use the inline syntax.

Taking a sample of data¶

There are two ways to take samples of dataset:

1. using a sampled dataset 
2. using a sample SQL function

print mldb.put("/v1/datasets/sampled_tokens", {
        "type": "sampled",
        "params": 
        {
            "rows": 10,
            "withReplacement": False,
            "dataset": {"id":"transposed_concepts"},
            "seed": 0
        }
    }
)

<Response [201]>

mldb.query("SELECT * FROM sampled_tokens")

The above code would be simlar to the following SQL function:

mldb.query(
    """
    SELECT * 
    FROM sample(
        transposed_concepts, 
        {
            rows: 10,
            withReplacement: False
        }
    )                      
    """
)

Again, the two methods provide the same desired outcome, allowing you to choose how to best manipulate your data.

Using datasets vs. SQL functions¶

As seen above, the two different ways to either transpose or sample data are equivalent. It is recommended to use Dataset types instead of SQL functions when the created table will be reused or called later in the program.

As seen in this tutorial, MLDB allows you to virtually manipluate data in multiple ways. This grants greater flexibility when constructing models and analyzing complex data.

Where to next?¶

Check out the other Tutorials and Demos.

	count
_rowName
Support vector machine	149
Logistic regression	99
Deep belief network	98
Restricted boltzmann machines	124
Hopfield network	53
Naive bayes classifier	165
Boltzmann machine	89
Autoencoder	31
Artificial neural network	40

	top_words_transp_dataset	top_words_transp_function
_rowName
learning	7	7
with	7	7
neural	6	6
machine	6	6
training	5	5
used	5	5
machines	5	5
network	5	5
model	5	5
which	4	4

	Naive bayes classifier	Artificial neural network	Boltzmann machine	Restricted boltzmann machines	Support vector machine	Deep belief network	Logistic regression	Hopfield network
_rowName
maximum	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
nervous	NaN	1	NaN	NaN	NaN	NaN	NaN	NaN
random	NaN	NaN	1	NaN	NaN	NaN	NaN	NaN
studied	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
neurons	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN
symmetric	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN
marked	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN
such	1	NaN	NaN	NaN	NaN	1	1	NaN
hopfield	NaN	NaN	1	NaN	NaN	NaN	NaN	4
only	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN