Word2Vec importer procedure

This procedure allows word and phrase embeddings from the Word2Vec tool to be loaded into MLDB.

Using these embeddings, each word or phrase in a language is convertible to a multi-dimensional set of coordinates (typically hundreds of coordinates are used). This allows for natural language to be represented in a form that is compatible with standard classification or clustering algorithms.

Configuration

A new procedure of type import.word2vec named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "import.word2vec",
    "params": {
        "dataFileUrl": <Url>,
        "outputDataset": <OutputDatasetSpec>,
        "offset": <int>,
        "limit": <int>,
        "named": <string>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
dataFileUrl Url	URL to load Excel workbook from
outputDataset OutputDatasetSpec `{"type":"embedding"}`	Output dataset for result
offset int `0`	Start at word number (0 = start)
limit int `-1`	Limit of number of rows to record (-1 = all)
named string `"word"`	Row name expression for output dataset. Note that each row must have a unique name.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

The dataFileUri parameter should point to a data file that is produced by the word2vec tool. A good default, containing the 3 billion most frequent words and phrases for English, is available here (warning: it is a 1.5 GB download), or directly here.

The file should be copied to a local file system or a high-bandwidth service, and optionally decompressed, before being opened from MLDB. MLDB will require around 8GB of memory to hold the entire file in an embedding dataset.

The limit parameter allows only the first n words of a file to be loaded. This is useful for when only embeddings for the most frequent words are required.

The offset parameter allows a number of words to be skipped. This is useful when loading multiple datasets in parallel.

Example

Sample query to load the word2vec dataset into an "embedding" dataset type and determine the closest words to "France".

mldb.put("/v1/procedures/w2vimport", {
    "type": "import.word2vec",
    "params": {
        "dataFileUrl": "file:///path/to/GoogleNews-vectors-negative300.bin",
        "output": "w2v",
        "limit": 100000
    }
})

mldb.put("/v1/functions/w2v_neighbors", {
    "type": "embedding.neighbors",
    "params": {
        "dataset": "w2v"
    }
})

mldb.query("SELECT w2v_neighbors({coords: 'France'})[neighbors] as *")

This gives the output:

rowName	0	1	2	3	4	5	6	7	8	9
result	France	Belgium	French	Germany	Paris	Spain	Italy	Europe	Morocco	Switzerland

Word2Vec importer procedure

Configuration

Example

See also