MLDB supports many algorithms to solve supervised learning tasks for both classification and regression. There are two procedures that are available to train a model:
classifier.experiment
procedure type performs both training and testing of a classifier.classifier.train
procedure type trains a classifier. If testing is required, it needs to be done manually with the classifier.test
procedure type.Both of those procedures share the configuration keys algorithm
, configuration
, configurationFile
and equalizationFactor
.
This document explains how to use them.
There are three ways of configuring which classifier will be trained:
configuration
and configurationFile
empty, and choose a standard algorithm configuration by name. See below for the contents of the default configurationFile
).configuration
parameter (JSON) and set
algorithm
to either empty (if the configuration is at the top level) or to
the dot separated path if it's not at the top level. See below for details on specifying your own configuration
.configurationFile
parameter, and set the algorithm as in number 2. See below for details on specifying your own configurationFile
.A configuration
JSON object or the contents of a configurationFile
looks like this
(see below for the contents of the default, overrideable configurationFile
:
{
"algorithm_name": {
"type": "classifier_type",
"parameter": "value",
...
},
...
}
The classifier training procedure includes support for the following types of classifiers.
These classifiers tend to be high performance implementations of well known classifiers which train and predict fast and are often a good default choice when a generic classification step is required.
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 2 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
trace | 0- | 0 | trace execution of training in a very fine-grained fashion |
max_depth | 0- or -1 | -1 | give maximum tree depth. -1 means go until data separated |
update_alg | normal gentle prob | prob | select the type of output that the tree gives |
random_feature_propn | 0.0-1.0 | 1 | proportion of the features to enable (for random forests) |
The update_alg
parameter can take three different values: prob
, normal
and gentle
.
Here is how they work, using an example with a leaf node that contains 8 positive
and 2 negative labels:
prob
is the proportion of positive classes: \(\#pos/(\#pos + \#neg)\). For our example the output is \(8/10=0.8\)normal
uses the margin between both probabilities, 80% positives, 20% negatives. For our example the output is\(0.8 - 0.2 = 0.6\) and \(1 - 0.6 = 0.4\). These scores are fed to a function, \(f\), of the exponential family (bounded between -infinity and +infinity). For our example the output is \(f(0.6) - f(0.4)\)gentle
also uses the margin, but with a different function, \(g\), bounded between -1 and 1.
In an ensemble, such as boosting or random forest, it is recommended to use this value. For our example the output is \(g(0.6) - g(0.4)\)For more details, please refer to Friedman, Hastie, Tibshirani, "Additive Logistic Regression: A Statistical View of Boosting" , The Annals of Statistics 2000, Vol. 28, No. 2, 337–407
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 2 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
add_bias | true|false|1|0 | true | add a constant bias term to the classifier? |
decode | true|false|1|0 | true | run the decoder (link function) after classification? |
link_function | logit probit comp_log_log linear log | logit | which link function to use for the output function |
regularization | none l1 l2 | l2 | type of regularization on the weights (L1 is slower due to an iterative algorithm) |
regularization_factor | -1 to infinite | 1.0000000000000001e-05 | regularization factor to use. auto-determined if negative (slower). the bigger this value is, the more regularization on the weights |
max_regularization_iteration | 1 to infinite | 1000 | maximum number of iterations for the L1 regularization |
regularization_epsilon | positive number | 0.0001 | smallest weight update before assuming convergence for the L1 iterative algorithm |
normalize | true|false|1|0 | true | normalize features to have zero mean and unit variance for greater numeric stability (slower training but recommended with L1 regularization) |
condition | true|false|1|0 | false | condition features to have no correlation for greater numeric stability (but much slower training) |
feature_proportion | 0 to 1 | 1 | use only a (random) portion of available features when training classifier |
The different options for the link_function
parameter are defined as follows:
Name | Link Function | Activation Function (inverse of the link function) |
---|---|---|
logit | \[g(x)=\ln \left( \frac{x}{1-x} \right) \] | \[g^{-1}(x) = \frac{1}{1 + e^{-x}}\] |
probit | \(g(x)=\Phi^{-1}(x)\) where \(\Phi\) is the normal distribution's CDF |
\[g^{-1}(x) = \Phi (x)\] |
comp_log_log | \[g(x)=\ln \left( - \ln \left( 1-x \right) \right)\] | \[g^{-1}(x) = 1 - e^{-e^x}\] |
linear | \[g(x)=x\] | \[g^{-1}(x) = x\] |
log | \[g(x)=\ln x\] | \[g^{-1}(x) = e^x\] |
The bagging algorithm, also known as bootstrap aggregating, is used in conjunction with another algorithm, for
instance with a decision tree to create
bagged decision trees. There is an example of this in the default configuration file for the bdt
key.
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 54 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
num_bags | N>=1 | 10 | number of bags to divide classifier into |
validation_split | 00.349999994 | how much of training data to hold off as validation data | |
weak_leaner | perceptron, bagging, boosting, naive_bayes, stump, decision_tree, glz, boosted_stumps, null, onevsall, fasttext |
See also : Bagging on Wikipedia.
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 2 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
validation_split | 00.300000012 | how much of training data to hold off as validation data | |
min_iter | 1-max_iter | 10 | minimum number of training iterations to run |
max_iter | >=min_iter | 500 | maximum number of training iterations to run |
cost_function | exponential logistic | exponential | select cost function for boosting weight update |
short_circuit_window | 0- | 0 | short circuit (stop) training if no improvement for N iter (0 off) |
trace_training_acc | true|false|1|0 | false | trace the accuracy of the training set as well as validation |
weak_leaner | perceptron, bagging, boosting, naive_bayes, stump, decision_tree, glz, boosted_stumps, null, onevsall, fasttext |
See also : Boosting on Wikipedia.
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 2 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
validation_split | 00.300000012 | how much of training data to hold off as validation data | |
min_iter | 1-max_iter | 10 | minimum number of training iterations to run |
max_iter | >=min_iter | 100 | maximum number of training iterations to run |
learning_rate | real | 0.00999999978 | positive: rate of learning relative to dataset size: negative for absolute |
arch | (see doc) | %i | hidden unit specification; %i=in vars, %o=out vars; eg 5_10 |
activation | logsig tanh tanhs identity softmax nonstandard | tanh | activation function for neurons |
output_activation | logsig tanh tanhs identity softmax nonstandard | tanh | activation function for output layer of neurons |
decorrelate | true|false|1|0 | true | decorrelate the features before training |
normalize | true|false|1|0 | true | normalize to zero mean and unit std before training |
batch_size | 0.0-1.0 or 1 - nvectors | 1024 | number of samples in each "mini batch" for stochastic |
target_value | 0.0-1.0 | 0.800000012 | the output for a 1 that we ask the network to provide |
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 2 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
trace | 0- | 0 | trace execution of training in a very fine-grained fashion |
feature_prop | 0.01 | which proportion of features do we look at | |
Note that our version of the Naive Bayes Classifier only supports discrete
features. Numerical-valued columns (types NUMBER
and INTEGER
) are accepted,
but they will be discretized prior to training. To do so, we will simply split
all the values in two, using the threshold that provides the best separation
of classes. You can always do your own discretization, for instance using a
CASE
expression.
Parameter | Range | Default | Description |
---|---|---|---|
verbosity | 0-5 | 2 | verbosity of information from training |
profile | true|false|1|0 | false | whether or not to profile |
validate | true|false|1|0 | false | perform expensive internal validation |
epoch | 1+ | 5 | Number of iterations over the data |
dims | 1+ | 100 | Number of dimensions in the embedding |
verbosity | 0+ | 0 | Level of verbosity in standard output |
Note that our version of the Fast Text Classifier only supports feature counts, and currently does not support regression.
See also : fastText on arXiv.
configurationFile
The default, overrideable configurationFile
contains the following predefined configurations, which can be accessed by name with the algorithm
parameter:
{
"nn": {
"_note": "Neural Network",
"type": "perceptron",
"arch": 50,
"verbosity": 3,
"max_iter": 100,
"learning_rate": 0.01,
"batch_size": 10
},
"bbdt": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 3,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 0.5
},
"min_iter": 5,
"max_iter": 30
},
"num_bags": 5
},
"bbdt2": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 1,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 5,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 0.8
},
"min_iter": 5,
"max_iter": 10,
"verbosity": 0
},
"num_bags": 32
},
"bbdt_d2": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 2,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 1
},
"min_iter": 5,
"max_iter": 30
},
"num_bags": 5
},
"bbdt_d5": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 5,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 1
},
"min_iter": 5,
"max_iter": 30
},
"num_bags": 5
},
"bdt": {
"_note": "Bagged decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"verbosity": 0,
"max_depth": 5
},
"num_bags": 20
},
"dt": {
"_note": "Plain decision tree",
"type": "decision_tree",
"max_depth": 8,
"verbosity": 3,
"update_alg": "prob"
},
"glz_linear": {
"_note": "Generalized Linear Model, linear link function, to be used for 'regression' mode",
"type": "glz",
"link_function": "linear",
"verbosity": 3,
"normalize ": "true",
"regularization" = "l2"
},
"glz": {
"_note": "Generalized Linear Model. Very smooth but needs very good features",
"type": "glz",
"verbosity": 3,
"normalize ": " true",
"regularization" = "l2"
},
"glz2": {
"_note": "Generalized Linear Model. Very smooth but needs very good features",
"type": "glz",
"verbosity": 3
},
"bglz": {
"_note": "Bagged random GLZ",
"type": "bagging",
"verbosity": 1,
"validation_split": 0.1,
"weak_learner": {
"type": "glz",
"feature_proportion": 1.0,
"verbosity": 0
},
"num_bags": 32
},
"bs": {
"_note": "Boosted stumps",
"type": "boosted_stumps",
"min_iter": 10,
"max_iter": 200,
"update_alg": "gentle",
"verbosity": 3
},
"bs2": {
"_note": "Boosted stumps",
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 1,
"verbosity": 0,
"update_alg": "gentle"
},
"min_iter": 5,
"max_iter": 300,
"trace_training_acc": "true"
},
"bbs2": {
"_note": "Bagged boosted stumps",
"type": "bagging",
"num_bags": 5,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 1,
"verbosity": 0,
"update_alg": "gentle"
},
"min_iter": 5,
"max_iter": 300,
"trace_training_acc": "true"
}
},
"naive_bayes": {
"_note": "Naive Bayes",
"type": "naive_bayes",
"feature_prop": "1",
"verbosity": 3
}
}
This section describes how you can set different weights for each example in your training set, either based upon the label or based upon a calculation over the row, to enable finer control over which examples the classifier makes the most effort to classify.
The equalizationFactor
parameter can be used to adjust an unbalanced training
set to be more balanced for training, which frequently has the effect of
requiring the classifiers to focus more on separating the positive and negative
classes rather then getting really high scores for the dominant class.
weight
expression in trainingData
.The optional weight
expression in the trainingData
parameter of the
configuration must evaluate to a positive number that implies how many
examples this counts for. For example, a single row with a weight of 2, or the
same single row duplicated twice with a weight of 1 will have the same effect.
Note that only the relative weights matter. Before the classifier is trained, the weights will be normalized so that they sum to 1 to avoid numerical issues in the classifier training process.
If the two weighting methods are combined, then the weight
expression will be
used to set the relative weight per example within its label class, and the
equalizationFactor
will adjust the relative weight of each class.
classifier.train
procedure type trains a classifier.classifier.test
procedure type allows the accuracy of a predictor to be tested against
held-out data.classifier.experiment
procedure type performs both training and testing of a classifier.