Procedures are named, reusable programs used to implement long-running batch operations with no return values. Procedures generally run over Datasets and can be configured via SQL expressions. The outputs of a Procedure can include Datasets and files. Procedures are used to:
Creating a Procedure does not automatically cause it to run unless the runOnCreation
flag is set. Procedures are run via a REST API call POST /v1/procedures/<id>/runs {<parameters>}
, where <parameters>
can override any of the parameters given to the procedure on creation. For most procedures, it is possible to perform a first run on creation of the procedure by setting the flag runOnCreation
to true in the parameters. Refer to the specific procedure documentation to see if it supports it.
A procedure may return results as follows:
Each run of a procedure has a JSON output similar to this:
{
"runStarted": "2015-10-21T19:33:00.091Z",
"state": "finished",
"runFinished": "2015-10-21T19:33:00.151Z",
"id": "2015-10-21T19:33:00.090622Z-5bc7042b732cb41f"
}
When making a synchronous call (default) to create a run, that output is returned in
the body of the response. For asynchronous calls, that output
is available by performing a GET
on /v1/procedures/<id>/runs/<id>
.
In addition, each run has a JSON details
output, which can be queried by performing a
GET
on /v1/procedures/<id>/runs/<id>/details
. This output depends on the procedure but it
may contain a more detailed set of information about what was done including elements like
logs of messages and errors.
Once created, a procedure returns its progress via a GET
at /v1/procedures/<id>
.
This uri can be obtained from the location
header that is part of the creation response.
Here is an example of a progress response for the bucketize
procedure
"progress": {
"steps": [
{
"started": "2016-12-15T19:43:52.9386692Z",
"ended": "2016-12-15T19:43:52.9768956Z",
"type": "percentile",
"name": "iterating",
"value": 1.0
},
{
"started": "2016-12-15T19:43:52.9768965Z",
"type": "percentile",
"name": "bucketizing",
"value": 0.8191999793052673
}
]
},
"state": "executing",
"id": "2016-12-15T19:43:52.938291Z-463496b56263af05"
}
Other procedures will have similar responses. Note that this is currently implemented for procedures
of type transform
, import.text
and bucketize
.
Procedures can take a long time to execute. It is possible to interrupt a running procedure using a
PUT
at /v1/procedures/<idp>/runs/<idr>/state
where idp
is the procedure id and idr
is the
run id with the following payload
{
"state": "cancelled"
}
Note that some processing is not cancellable. As a result, the procedure may continue running for
some time before it is finally interrupted.
Procedures are created via a REST API call with one of the following types:
Type | Description | Doc |
---|---|---|
classifier.experiment | Train and test a classifier | [doc] |
classifier.test | Calculate the accuracy of a classifier on held-out data | [doc] |
classifier.train | Train a supervised classifier | [doc] |
export.csv | Exports a dataset to a target location as a CSV | [doc] |
import.git | Import a Git repository's metadata into MLDB | [doc] |
import.json | Import a text file with one JSON per line into MLDB | [doc] |
import.sentiwordnet | Import a SentiWordNet file into MLDB | [doc] |
import.text | Import from a text file, line by line. | [doc] |
import.word2vec | Import a word2vec file into MLDB | [doc] |
kmeans.train | Simple clustering algorithm based on cluster centroids in embedding space | [doc] |
melt | Performs a melt operation on a dataset | [doc] |
mongodb.import | Import a dataset from MongoDB | [doc] |
permuter.run | Run a child procedure with permutations of its configuration | [doc] |
probabilizer.train | Trains a model to calibrate a score into a probability | [doc] |
randomforest.binary.train | Train a supervised binary random forest | [doc] |
statsTable.bagOfWords.train | Create statistical tables of trials against outcomes for bag of words | [doc] |
statsTable.train | Create statistical tables of trials against outcomes | [doc] |
summary.statistics | Creates a dataset with summary statistics for each columns of an input dataset | [doc] |
svd.train | Train a SVD to convert rows or columns to embedding coordinates | [doc] |
tfidf.train | Prepare data for a TF-IDF function | [doc] |
transform | Apply an SQL expression over a dataset to transform into another dataset | [doc] |
tsne.train | Project a high dimensional space into a low-dimensional space suitable for visualization | [doc] |