# Importing Text

This procedure is used to import data from text files with each line in the file corresponding to a row in the output dataset. Delimited file formats such as Comma-Separated Values (CSV) or Tab-Separated Values (TSV) are supported by configuring delimiters and quote characters.

## Configuration

A new procedure of type import.text named <id> can be created as follows:

mldb.put("/v1/procedures/"+<id>, {
"type": "import.text",
"params": {
"dataFileUrl": <Url>,
"outputDataset": <OutputDatasetSpec>,
"quoteChar": <string>,
"delimiter": <string>,
"limit": <int>,
"offset": <int>,
"encoding": <string>,
"structuredColumnNames": <bool>,
"replaceInvalidCharactersWith": <string>,
"select": <SqlSelectExpression>,
"where": <string>,
"named": <string>,
"timestamp": <string>,
"allowMultiLines": <bool>,
"ignoreExtraColumns": <bool>,
"runOnCreation": <bool>
}
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

dataFileUrl
Url

URL of the text data to import

outputDataset
OutputDatasetSpec
{"type":"tabular"}

Dataset to record the data into.

ARRAY [ string ]

quoteChar
string
"\""

Character to enclose strings

delimiter
string
","

Delimiter for column separation

limit
int
0

Maximum number of lines to process. Bad lines including empty lines contribute to the limit. As a result, it is possible for the dataset to contain less rows that the requested limit.

offset
int
0

Skip the first n lines (excluding the header if present).

encoding
string
"utf-8"

Character encoding of file: 'us-ascii', 'ascii', 'latin1', 'iso8859-1', 'utf8' or 'utf-8'

bool
false

If true, any line causing a parsing error will be skipped. Empty lines are considered bad lines.

structuredColumnNames
bool
false

If true, column names that look like 'x.y' will import like { x: { y: ... } } instead of { "x.y": ... }, in other words structure will be preserved. The flip side is that quotes will need to be doubled and any name that includes a period will need to be quoted. The default is to not preserve structure

replaceInvalidCharactersWith
string

If this is set, it should be a single Unicode character will be used to replace badly-encoded characters in the input. The default is nothing, which will cause lines with badly-encoded characters to throw an error.

select
SqlSelectExpression
"*"

Which columns to use.

where
string
"true"

Which lines to use to create rows.

named
string
"lineNumber()"

Row name expression for output dataset. Note that each row must have a unique name.

timestamp
string
"fileTimestamp()"

Expression for row timestamp.

allowMultiLines
bool
false

Allows columns with multi-line quoted strings. This option disables many optimizations and makes the procedure run much slower. Only use if necessary. The offset parameter will not be reliable when this is activated.

bool
false

If true, the indexes of the columns will be used to name them.This cannot be set to true if headers is defined.

ignoreExtraColumns
bool
false

Ignore extra columns that weren't in header. This allows for files that have optional trailing columns that aren't listed in the header or for files with a partially fixed, partially variable column set to be imported.

runOnCreation
bool
true

If true, the procedure will be run immediately. The response will contain an extra field called firstRun pointing to the URL of the run.

## Functions available when creating rows

The following functions are available in the select, named, where and timestamp expressions:

• lineNumber(): returns the line number in the file
• rowHash(): returns the internal hash value of the current row, useful for random sampling
• fileTimestamp(): returns the timestamp (last modified time) of the file
• dataFileUrl(): returns the URL of the file, from the configuration

## Notes

• The named clause must result in unique row names. If the row names are not unique, the dataset will fail to be created when being indexed. The default named expression, which is lineNumber(), will result in each line having a unique name.
• The number of rows skipped (due to a parsing error) will be returned in the numLineErrors field of the dataset status.
• The column used for the row name will not be automatically removed from the dataset. If it needs to be removed, it should be done from the select using the excluding (colName) syntax.
• Columns can be renamed using the select statement. For example, to add the prefix xyz. to each field, use * AS xyz.* in the select parameter.