Data persistence will vary from dataset type to dataset type. For starters, not all datasets types are persistable (e.g. the
sqliteSparse dataset type is but the
embedding dataset type is not) or even writable/mutable (e.g. the
sparse.mutable dataset type is but the
embedding dataset type is not).
In general, however, dataset types which can be persisted will take an
dataFileUrl (or equivalent) parameter, which specifies a Url where to read or write data.
The persistence characteristics of that dataset type therefore depend on the underlying protocol: data loaded from
file:// can be memory-mapped directly but data loaded from
s3:// cannot etc. Note that
file:// is volatile if the MLDB docker container is not booted with an
mldb_data persistent directory mapped to a directory in the host filesystem! See Running MLDB for more details.
By default, MLDB stores a copy of all entity configurations, including dataset configurations, in the
mldb_data directory of the docker container (see Running MLDB). Upon (re)boot, MLDB will attempt to reload all of the entity configurations it can find, including the loading of datasets from their URLs.
This means that so long as an
mldb_data directory is mapped to a filesystem, MLDB will generally safely reload its (persistable/persisted) state. This safety is greatly enhanced if data is persisted to something like S3 instead of the local filesystem. Persisting data to S3 also enables the use of multiple MLDB instances, which can all load data (datasets, model artifacts) from the same URLs on S3.