Git importer procedure

This procedure allows commit metadata from a local git repository to be imported.

This allows for exploration and prediction to be performed over Git repositories.

Configuration

A new procedure of type import.git named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "import.git",
    "params": {
        "repository": <Url>,
        "outputDataset": <OutputDatasetSpec>,
        "revisions": <ARRAY [ string ]>,
        "importStats": <bool>,
        "importTree": <bool>,
        "ignoreUnknownEncodings": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

repository
Url

Git repository to load from. This is currently limited to file:// urls which point to an already cloned repository on local disk. Remote repositories will need to be checked out beforehand using the git command line tools.

outputDataset
OutputDatasetSpec
{"type":"sparse.mutable"}

Output dataset for result. One row will be produced per commit. See the documentation for the output format.

revisions
ARRAY [ string ]
["HEAD"]

Revisions to load from Git (eg, HEAD, HEAD~20..HEAD, tags/*). See the gitrevisions (7) documentation. Default is all revisions reachable from HEAD

importStats
bool
false

If true, then import the stats (number of files changed, lines added and lines deleted)

importTree
bool
false

If true, then import the tree (names of files changed)

ignoreUnknownEncodings
bool
false

If true (default), ignore commit messages with unknown encodings (supported are ISO-8859-1 and UTF-8) and replace with a placeholder. If false, messages with unknown encodings will cause the commit to abort.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

The repository parameter should point to a local directory that contains a .git subdirectory with the git metadata. It is not necessary that the local directory be checked out, in other words a bare repository is acceptable.

The revision parameter is a GIT revision specification. For example, it can be HEAD or /tags/* or /revs/*. See the gitrevisions documentation for more details on how a revspec is specified.

Output format

Each row will contain the following fixed columns:

In addition, if the importStats option is true, then non-merge commits will have the following columns:

In addition, if the importTree option is true, then non-merge commits will have the following columns for each file that was modified by the commit:

Example

To load the Docker git repository into a dataset that has been checked out from https://github.com/docker/docker.git in /mldb_data/docker, the following procedure can be posted and run:

mldb.put("/v1/procedures/gitimport", {
    "type": "import.git",
    "params": {
        "repository": "file:///mldb_data/docker",
        "outputDataset": "git",
        "runOnCreation": true
    }
})

This can then be queried for the top 5 contributors in terms of commits:

SELECT count(*) AS cnt,
       author,
       min(when({*})) AS earliest,
       max(when({*})) AS latest,
       sum(filesChanged) as changes,
       sum(insertions) as insertions,
       sum(deletions) as deletions
FROM git
GROUP BY author
ORDER BY cnt DESC
LIMIT 5

which yields

cnt author earliest latest changes insertions deletion
1844 "Michael Crosby" "2013-06-03T21:39:00Z" "2015-11-18T11:08:13Z" 3239 71376 103630
1483 "Victor Vieux" "2013-04-10T19:30:57Z" "2015-10-14T17:46:59Z" 2024 62727 72644
1248 "Solomon Hykes" "2013-01-19T16:07:19Z" "2015-11-05T15:22:37Z" 2144 48645 127953
744 "Tianon Gravi" "2013-04-21T19:19:38Z" "2015-11-17T09:44:36Z" 1925 144679 33214
632 "Jessie Frazelle" "2014-09-02T15:18:32Z" "2015-09-04T13:38:41Z" 11 197 186