This procedure is used to import data from text files with each line in the file corresponding to a row in the output dataset. Delimited file formats such as Comma-Separated Values (CSV) or Tab-Separated Values (TSV) are supported by configuring delimiters and quote characters.
A new procedure of type import.text
named <id>
can be created as follows:
mldb.put("/v1/procedures/"+<id>, {
"type": "import.text",
"params": {
"dataFileUrl": <Url>,
"outputDataset": <OutputDatasetSpec>,
"headers": <ARRAY [ string ]>,
"quoteChar": <string>,
"delimiter": <string>,
"limit": <int>,
"offset": <int>,
"encoding": <string>,
"ignoreBadLines": <bool>,
"structuredColumnNames": <bool>,
"replaceInvalidCharactersWith": <string>,
"select": <SqlSelectExpression>,
"where": <string>,
"named": <string>,
"timestamp": <string>,
"allowMultiLines": <bool>,
"autoGenerateHeaders": <bool>,
"ignoreExtraColumns": <bool>,
"runOnCreation": <bool>
}
})
with the following key-value definitions for params
:
Field, Type, Default | Description |
---|---|
dataFileUrl | URL of the text data to import |
outputDataset | Dataset to record the data into. |
headers | List of headers for when first row doesn't contain headers |
quoteChar | Character to enclose strings |
delimiter | Delimiter for column separation |
limit | Maximum number of lines to process. Bad lines including empty lines contribute to the limit. As a result, it is possible for the dataset to contain less rows that the requested limit. |
offset | Skip the first n lines (excluding the header if present). |
encoding | Character encoding of file: 'us-ascii', 'ascii', 'latin1', 'iso8859-1', 'utf8' or 'utf-8' |
ignoreBadLines | If true, any line causing a parsing error will be skipped. Empty lines are considered bad lines. |
structuredColumnNames | If true, column names that look like 'x.y' will import like |
replaceInvalidCharactersWith | If this is set, it should be a single Unicode character will be used to replace badly-encoded characters in the input. The default is nothing, which will cause lines with badly-encoded characters to throw an error. |
select | Which columns to use. |
where | Which lines to use to create rows. |
named | Row name expression for output dataset. Note that each row must have a unique name. |
timestamp | Expression for row timestamp. |
allowMultiLines | Allows columns with multi-line quoted strings. This option disables many optimizations and makes the procedure run much slower. Only use if necessary. The |
autoGenerateHeaders | If true, the indexes of the columns will be used to name them.This cannot be set to true if headers is defined. |
ignoreExtraColumns | Ignore extra columns that weren't in header. This allows for files that have optional trailing columns that aren't listed in the header or for files with a partially fixed, partially variable column set to be imported. |
runOnCreation | If true, the procedure will be run immediately. The response will contain an extra field called |
The following functions are available in the select
, named
, where
and timestamp
expressions:
lineNumber()
: returns the line number in the filerowHash()
: returns the internal hash value of the current row, useful for random samplingfileTimestamp()
: returns the timestamp (last modified time) of the filedataFileUrl()
: returns the URL of the file, from the configurationnamed
clause must result in unique row names. If the row names are not
unique, the dataset will fail to be created when being indexed. The default
named
expression, which is lineNumber()
, will result in each line having
a unique name.numLineErrors
field of the dataset status.excluding (colName)
syntax.xyz.
to each field, use * AS xyz.*
in the select
parameter.import.json
procedure type can be used to import a text file with one JSON per line.