MongoDB Import procedure

The MongoDB Import Procedure type is used to import a MongoDB collection into a dataset.

Caveat

Configuration

A new procedure of type mongodb.import named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "mongodb.import",
    "params": {
        "runOnCreation": <bool>,
        "uriConnectionScheme": <string>,
        "collection": <string>,
        "outputDataset": <OutputDatasetSpec>,
        "limit": <int>,
        "offset": <int>,
        "ignoreParsingErrors": <bool>,
        "select": <SqlSelectExpression>,
        "where": <string>,
        "named": <string>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

uriConnectionScheme
string

MongoDB connection scheme. mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database]]

collection
string

The collection to import

outputDataset
OutputDatasetSpec
{"type":"sparse.mutable"}

Output dataset configuration. This may refer either to an existing dataset, or a fully specified but non-existing dataset which will be created by the procedure.

limit
int
0

Maximum number of lines to process

offset
int
0

Skip the first n lines.

ignoreParsingErrors
bool
false

If true, any record causing an error will be skipped. Any record with BSON regex or BSON internal data type will cause an error.

select
SqlSelectExpression
"*"

Which columns to use.

where
string
"true"

Which lines to use to create rows.

named
string
"oid()"

Row name expression for output dataset. Note that each row must have a unique name and that names cannot be objects. The default value, oid(), refers to the MongoDB ObjectID.

Example

For this example, we will use a MongoDB database populated with data provided by the book MongoDB In Action. The zipped json file is available at http://mng.bz/dOpd.

Here we import the zips collection into an MLDB dataset called mongodb_zips.

mldb.post('/v1/procedures', {
    'type' : 'mongodb.import',
    'params' : {
        'connectionScheme': 'mongodb://somehost.mldb.ai:11712/zips',
        'collection': 'zips',
        'outputDataset' : {
            'id' : 'mongodb_zips',
            'type' : 'sparse.mutable'
        }
    }
})

We can now query the imported data as we would any other MLDB Dataset.

mldb.query("SELECT * FROM mongodb_zips LIMIT 5")
_id city loc.x loc.y pop state zip
_rowName
57d2f5eb21af5ee9c4e27f08 57d2f5eb21af5ee9c4e27f08 BONDURANT 110.335287 43.223798 116 WY
57d2f5eb21af5ee9c4e27f07 57d2f5eb21af5ee9c4e27f07 KAYCEE 106.563230 43.723625 876 WY
57d2f5eb21af5ee9c4e27f05 57d2f5eb21af5ee9c4e27f05 CLEARMONT 106.458071 44.661010 350 WY
57d2f5eb21af5ee9c4e27f03 57d2f5eb21af5ee9c4e27f03 ARVADA 106.109191 44.689876 107 WY
57d2f5eb21af5ee9c4e27f01 57d2f5eb21af5ee9c4e27f01 COKEVILLE 110.916419 42.057983 905 WY

Here we did not provide any named parameter so oid() was used. This is why _rowName and _id have the same values.

Another element to note is how the loc object was imported. The sub object was disassembled and imported as loc.x and loc.y into MLDB.