This procedure allows word and phrase embeddings from the Word2Vec tool to be loaded into MLDB.
Using these embeddings, each word or phrase in a language is convertible to a multi-dimensional set of coordinates (typically hundreds of coordinates are used). This allows for natural language to be represented in a form that is compatible with standard classification or clustering algorithms.
A new procedure of type import.word2vec
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "import.word2vec",
"params": {
"dataFileUrl": <Url>,
"outputDataset": <OutputDatasetSpec>,
"offset": <int>,
"limit": <int>,
"named": <string>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
dataFileUrl | URL to load Excel workbook from |
outputDataset | Output dataset for result |
offset | Start at word number (0 = start) |
limit | Limit of number of rows to record (-1 = all) |
named | Row name expression for output dataset. Note that each row must have a unique name. |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
The dataFileUri
parameter should point to a data file that is produced
by the word2vec
tool. A good default, containing the 3 billion most
frequent words and phrases for English, is available
here
(warning: it is a 1.5 GB download), or directly here.
The file should be copied to a local file system or a high-bandwidth
service, and optionally decompressed, before being opened from MLDB. MLDB will
require around 8GB of memory to hold the entire file in an embedding
dataset.
The limit
parameter allows only the first n words of a file to be loaded.
This is useful for when only embeddings for the most frequent words are
required.
The offset
parameter allows a number of words to be skipped. This is
useful when loading multiple datasets in parallel.
Sample query to load the word2vec dataset into an "embedding" dataset type and determine the closest words to "France".
mldb.put("/v1/procedures/w2vimport", {
"type": "import.word2vec",
"params": {
"dataFileUrl": "file:///path/to/GoogleNews-vectors-negative300.bin",
"output": "w2v",
"limit": 100000
}
})
mldb.put("/v1/functions/w2v_neighbors", {
"type": "embedding.neighbors",
"params": {
"dataset": "w2v"
}
})
mldb.query("SELECT w2v_neighbors({coords: 'France'})[neighbors] as *")
This gives the output:
rowName | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
result | France | Belgium | French | Germany | Paris | Spain | Italy | Europe | Morocco | Switzerland |
embedding.neighbors
function type is used to get the nearest neighbor rows in an existing embedding datasetpooling
function type is used to embed a bag of words in a vector space like Word2Vecembedding
dataset type is the perfect dataset to hold
the output of the word2vec tool.