Importing Text

This procedure is used to import data from text files with each line in the file corresponding to a row in the output dataset. Delimited file formats such as Comma-Separated Values (CSV) or Tab-Separated Values (TSV) are supported by configuring delimiters and quote characters.

Configuration

A new procedure of type import.text named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "import.text",
    "params": {
        "dataFileUrl": <Url>,
        "outputDataset": <OutputDatasetSpec>,
        "headers": <ARRAY [ string ]>,
        "quoteChar": <string>,
        "delimiter": <string>,
        "limit": <int>,
        "offset": <int>,
        "encoding": <string>,
        "ignoreBadLines": <bool>,
        "structuredColumnNames": <bool>,
        "replaceInvalidCharactersWith": <string>,
        "select": <SqlSelectExpression>,
        "where": <string>,
        "named": <string>,
        "timestamp": <string>,
        "allowMultiLines": <bool>,
        "autoGenerateHeaders": <bool>,
        "ignoreExtraColumns": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, Default	Description
dataFileUrl Url	URL of the text data to import
outputDataset OutputDatasetSpec `{"type":"tabular"}`	Dataset to record the data into.
headers ARRAY [ string ]	List of headers for when first row doesn't contain headers
quoteChar string `"\""`	Character to enclose strings
delimiter string `","`	Delimiter for column separation
limit int `0`	Maximum number of lines to process. Bad lines including empty lines contribute to the limit. As a result, it is possible for the dataset to contain less rows that the requested limit.
offset int `0`	Skip the first n lines (excluding the header if present).
encoding string `"utf-8"`	Character encoding of file: 'us-ascii', 'ascii', 'latin1', 'iso8859-1', 'utf8' or 'utf-8'
ignoreBadLines bool `false`	If true, any line causing a parsing error will be skipped. Empty lines are considered bad lines.
structuredColumnNames bool `false`	If true, column names that look like 'x.y' will import like `{ x: { y: ... } }` instead of `{ "x.y": ... }`, in other words structure will be preserved. The flip side is that quotes will need to be doubled and any name that includes a period will need to be quoted. The default is to not preserve structure
replaceInvalidCharactersWith string	If this is set, it should be a single Unicode character will be used to replace badly-encoded characters in the input. The default is nothing, which will cause lines with badly-encoded characters to throw an error.
select SqlSelectExpression `"*"`	Which columns to use.
where string `"true"`	Which lines to use to create rows.
named string `"lineNumber()"`	Row name expression for output dataset. Note that each row must have a unique name.
timestamp string `"fileTimestamp()"`	Expression for row timestamp.
allowMultiLines bool `false`	Allows columns with multi-line quoted strings. This option disables many optimizations and makes the procedure run much slower. Only use if necessary. The `offset` parameter will not be reliable when this is activated.
autoGenerateHeaders bool `false`	If true, the indexes of the columns will be used to name them.This cannot be set to true if headers is defined.
ignoreExtraColumns bool `false`	Ignore extra columns that weren't in header. This allows for files that have optional trailing columns that aren't listed in the header or for files with a partially fixed, partially variable column set to be imported.
runOnCreation bool `true`	If true, the procedure will be run immediately. The response will contain an extra field called `firstRun` pointing to the URL of the run.

Functions available when creating rows

The following functions are available in the select, named, where and timestamp expressions:

lineNumber(): returns the line number in the file
rowHash(): returns the internal hash value of the current row, useful for random sampling
fileTimestamp(): returns the timestamp (last modified time) of the file
dataFileUrl(): returns the URL of the file, from the configuration

Notes

The named clause must result in unique row names. If the row names are not unique, the dataset will fail to be created when being indexed. The default named expression, which is lineNumber(), will result in each line having a unique name.
The number of rows skipped (due to a parsing error) will be returned in the numLineErrors field of the dataset status.
The column used for the row name will not be automatically removed from the dataset. If it needs to be removed, it should be done from the select using the excluding (colName) syntax.
Columns can be renamed using the select statement. For example, to add the prefix xyz. to each field, use * AS xyz.* in the select parameter.

Examples

The Loading Data Tutorial
The Predicting Titanic Survival demo

Importing Text

Configuration

Functions available when creating rows

Notes

Examples

See also