Word2Vec importer procedure

This procedure allows word and phrase embeddings from the Word2Vec tool to be loaded into MLDB.

Using these embeddings, each word or phrase in a language is convertible to a multi-dimensional set of coordinates (typically hundreds of coordinates are used). This allows for natural language to be represented in a form that is compatible with standard classification or clustering algorithms.

Configuration

A new procedure of type import.word2vec named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "import.word2vec",
    "params": {
        "dataFileUrl": <Url>,
        "outputDataset": <OutputDatasetSpec>,
        "offset": <int>,
        "limit": <int>,
        "named": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

dataFileUrl
Url

URL to load Excel workbook from

outputDataset
OutputDatasetSpec
{"type":"embedding"}

Output dataset for result

offset
int
0

Start at word number (0 = start)

limit
int
-1

Limit of number of rows to record (-1 = all)

named
string
"word"

Row name expression for output dataset. Note that each row must have a unique name.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

The dataFileUri parameter should point to a data file that is produced by the word2vec tool. A good default, containing the 3 billion most frequent words and phrases for English, is available here (warning: it is a 1.5 GB download), or directly here.

The file should be copied to a local file system or a high-bandwidth service, and optionally decompressed, before being opened from MLDB. MLDB will require around 8GB of memory to hold the entire file in an embedding dataset.

The limit parameter allows only the first n words of a file to be loaded. This is useful for when only embeddings for the most frequent words are required.

The offset parameter allows a number of words to be skipped. This is useful when loading multiple datasets in parallel.

Example

Sample query to load the word2vec dataset into an "embedding" dataset type and determine the closest words to "France".

mldb.put("/v1/procedures/w2vimport", {
    "type": "import.word2vec",
    "params": {
        "dataFileUrl": "file:///path/to/GoogleNews-vectors-negative300.bin",
        "output": "w2v",
        "limit": 100000
    }
})

mldb.put("/v1/functions/w2v_neighbors", {
    "type": "embedding.neighbors",
    "params": {
        "dataset": "w2v"
    }
})

mldb.query("SELECT w2v_neighbors({coords: 'France'})[neighbors] as *")

This gives the output:

rowName 0 1 2 3 4 5 6 7 8 9
result France Belgium French Germany Paris Spain Italy Europe Morocco Switzerland

See also