Importing Text

This procedure is used to import data from text files with each line in the file corresponding to a row in the output dataset. Delimited file formats such as Comma-Separated Values (CSV) or Tab-Separated Values (TSV) are supported by configuring delimiters and quote characters.

Configuration

A new procedure of type import.text named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
    "type": "import.text",
    "params": {
        "dataFileUrl": <Url>,
        "outputDataset": <OutputDatasetSpec>,
        "headers": <ARRAY [ string ]>,
        "quoteChar": <string>,
        "delimiter": <string>,
        "limit": <int>,
        "offset": <int>,
        "encoding": <string>,
        "ignoreBadLines": <bool>,
        "structuredColumnNames": <bool>,
        "replaceInvalidCharactersWith": <string>,
        "select": <SqlSelectExpression>,
        "where": <string>,
        "named": <string>,
        "timestamp": <string>,
        "allowMultiLines": <bool>,
        "autoGenerateHeaders": <bool>,
        "ignoreExtraColumns": <bool>,
        "runOnCreation": <bool>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

dataFileUrl
Url

URL of the text data to import

outputDataset
OutputDatasetSpec
{"type":"tabular"}

Dataset to record the data into.

headers
ARRAY [ string ]

List of headers for when first row doesn't contain headers

quoteChar
string
"\""

Character to enclose strings

delimiter
string
","

Delimiter for column separation

limit
int
0

Maximum number of lines to process. Bad lines including empty lines contribute to the limit. As a result, it is possible for the dataset to contain less rows that the requested limit.

offset
int
0

Skip the first n lines (excluding the header if present).

encoding
string
"utf-8"

Character encoding of file: 'us-ascii', 'ascii', 'latin1', 'iso8859-1', 'utf8' or 'utf-8'

ignoreBadLines
bool
false

If true, any line causing a parsing error will be skipped. Empty lines are considered bad lines.

structuredColumnNames
bool
false

If true, column names that look like 'x.y' will import like { x: { y: ... } } instead of { "x.y": ... }, in other words structure will be preserved. The flip side is that quotes will need to be doubled and any name that includes a period will need to be quoted. The default is to not preserve structure

replaceInvalidCharactersWith
string

If this is set, it should be a single Unicode character will be used to replace badly-encoded characters in the input. The default is nothing, which will cause lines with badly-encoded characters to throw an error.

select
SqlSelectExpression
"*"

Which columns to use.

where
string
"true"

Which lines to use to create rows.

named
string
"lineNumber()"

Row name expression for output dataset. Note that each row must have a unique name.

timestamp
string
"fileTimestamp()"

Expression for row timestamp.

allowMultiLines
bool
false

Allows columns with multi-line quoted strings. This option disables many optimizations and makes the procedure run much slower. Only use if necessary. The offset parameter will not be reliable when this is activated.

autoGenerateHeaders
bool
false

If true, the indexes of the columns will be used to name them.This cannot be set to true if headers is defined.

ignoreExtraColumns
bool
false

Ignore extra columns that weren't in header. This allows for files that have optional trailing columns that aren't listed in the header or for files with a partially fixed, partially variable column set to be imported.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

Functions available when creating rows

The following functions are available in the select, named, where and timestamp expressions:

Notes

Examples

See also