Intro to Procedures

Procedures are named, reusable programs used to implement long-running batch operations with no return values. Procedures generally run over Datasets and can be configured via SQL expressions. The outputs of a Procedure can include Datasets and files. Procedures are used to:

Running a Procedure

Creating a Procedure does not automatically cause it to run unless the runOnCreation flag is set. Procedures are run via a REST API call POST /v1/procedures/<id>/runs {<parameters>}, where <parameters> can override any of the parameters given to the procedure on creation. For most procedures, it is possible to perform a first run on creation of the procedure by setting the flag runOnCreation to true in the parameters. Refer to the specific procedure documentation to see if it supports it.

Obtaining results of a procedure

A procedure may return results as follows:

Getting the progress of a procedure

Once created, a procedure returns its progress via a GET at /v1/procedures/<id>. This uri can be obtained from the location header that is part of the creation response. Here is an example of a progress response for the bucketize procedure

"progress": {
        "steps": [
            {
                "started": "2016-12-15T19:43:52.9386692Z",
                "ended": "2016-12-15T19:43:52.9768956Z",
                "type": "percentile",
                "name": "iterating",
                "value": 1.0
            },
            {
                "started": "2016-12-15T19:43:52.9768965Z",
                "type": "percentile",
                "name": "bucketizing",
                "value": 0.8191999793052673
            }
        ]
    },
    "state": "executing",
    "id": "2016-12-15T19:43:52.938291Z-463496b56263af05"
}

Other procedures will have similar responses. Note that this is currently implemented for procedures of type transform, import.text and bucketize.

Cancelling a procedure

Procedures can take a long time to execute. It is possible to interrupt a running procedure using a PUT at /v1/procedures/<idp>/runs/<idr>/state where idp is the procedure id and idr is the run id with the following payload { "state": "cancelled" } Note that some processing is not cancellable. As a result, the procedure may continue running for some time before it is finally interrupted.

Available Procedure Types

Procedures are created via a REST API call with one of the following types:

TypeDescriptionDoc
classifier.experimentTrain and test a classifier[doc]
classifier.testCalculate the accuracy of a classifier on held-out data[doc]
classifier.trainTrain a supervised classifier[doc]
export.csvExports a dataset to a target location as a CSV[doc]
import.gitImport a Git repository's metadata into MLDB[doc]
import.jsonImport a text file with one JSON per line into MLDB[doc]
import.sentiwordnetImport a SentiWordNet file into MLDB[doc]
import.textImport from a text file, line by line.[doc]
import.word2vecImport a word2vec file into MLDB[doc]
kmeans.trainSimple clustering algorithm based on cluster centroids in embedding space[doc]
meltPerforms a melt operation on a dataset[doc]
mongodb.importImport a dataset from MongoDB[doc]
permuter.runRun a child procedure with permutations of its configuration[doc]
probabilizer.trainTrains a model to calibrate a score into a probability[doc]
randomforest.binary.trainTrain a supervised binary random forest[doc]
statsTable.bagOfWords.trainCreate statistical tables of trials against outcomes for bag of words[doc]
statsTable.trainCreate statistical tables of trials against outcomes[doc]
summary.statisticsCreates a dataset with summary statistics for each columns of an input dataset[doc]
svd.trainTrain a SVD to convert rows or columns to embedding coordinates[doc]
tfidf.trainPrepare data for a TF-IDF function[doc]
transformApply an SQL expression over a dataset to transform into another dataset[doc]
tsne.trainProject a high dimensional space into a low-dimensional space suitable for visualization[doc]

See also