Sampled Dataset

The sampled dataset type allows sampling of another dataset. The sampling operation is virtual, in other words, no copy of the initial dataset is made.

Configuration

A new dataset of type sampled named <id> can be created as follows:

mldb.put("/v1/datasets/"+<id>, {
    "type": "sampled",
    "params": {
        "rows": <int>,
        "fraction": <float>,
        "withReplacement": <bool>,
        "dataset": <SqlFromExpression>,
        "seed": <int>
    }
})

with the following key-value definitions for params:

Field, Type, DefaultDescription

rows
int
0

Number of rows to sample from dataset. Cannot be used with fraction. Cannot be higher than the number of rows in dataset unless withReplacement = 1. Default = 1 if fraction is 0.

fraction
float
0.0

Fraction of rows to sample from dataset. Cannot be used when rows != 0. Value should be between 0 and 1.

withReplacement
bool
false

Sample with or without replacement. Sampling with replacement means that the same input row can appear in the output more than once.

dataset
SqlFromExpression

Dataset to sample

seed
int
0

Seed value for the random number generator. The purpose of this parameter is to permit reproducible random samples. This parameter is optional, with the default value being selected randomly for each sample.

See also