MLDB supports many algorithms to solve supervised learning tasks for both classification and regression. There are two procedures that are available to train a model:

- the
`classifier.experiment`

procedure type performs both training and testing of a classifier. - the
`classifier.train`

procedure type trains a classifier. If testing is required, it needs to be done manually with the`classifier.test`

procedure type.

Both of those procedures share the configuration keys `algorithm`

, `configuration`

, `configurationFile`

and `equalizationFactor`

.
This document explains how to use them.

- Methods of configuring a classifier training
- Configuration file contents
- Algorithms
- Default configuration file
- Training Weighting

There are three ways of configuring which classifier will be trained:

- Leave the
`configuration`

and`configurationFile`

empty, and choose a standard algorithm configuration by name. See below for the contents of the default`configurationFile`

). - Put the configuration inline in the
`configuration`

parameter (JSON) and set`algorithm`

to either empty (if the configuration is at the top level) or to the dot separated path if it's not at the top level. See below for details on specifying your own`configuration`

. - Put the configuration in an external resource identified by the
`configurationFile`

parameter, and set the algorithm as in number 2. See below for details on specifying your own`configurationFile`

.

A `configuration`

JSON object or the contents of a `configurationFile`

looks like this
(see below for the contents of the default, overrideable `configurationFile`

:

```
{
"algorithm_name": {
"type": "classifier_type",
"parameter": "value",
...
},
...
}
```

The classifier training procedure includes support for the following types of classifiers.

These classifiers tend to be high performance implementations of well known classifiers which train and predict fast and are often a good default choice when a generic classification step is required.

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 2 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

trace | 0- | 0 | trace execution of training in a very fine-grained fashion |

max_depth | 0- or -1 | -1 | give maximum tree depth. -1 means go until data separated |

update_alg | normal gentle prob | prob | select the type of output that the tree gives |

random_feature_propn | 0.0-1.0 | 1 | proportion of the features to enable (for random forests) |

The `update_alg`

parameter can take three different values: `prob`

, `normal`

and `gentle`

.
Here is how they work, using an example with a leaf node that contains 8 positive
and 2 negative labels:

`prob`

is the proportion of positive classes: \(\#pos/(\#pos + \#neg)\). For our example the output is \(8/10=0.8\)`normal`

uses the margin between both probabilities, 80% positives, 20% negatives. For our example the output is\(0.8 - 0.2 = 0.6\) and \(1 - 0.6 = 0.4\). These scores are fed to a function, \(f\), of the exponential family (bounded between -infinity and +infinity). For our example the output is \(f(0.6) - f(0.4)\)`gentle`

also uses the margin, but with a different function, \(g\), bounded between -1 and 1. In an ensemble, such as boosting or random forest, it is recommended to use this value. For our example the output is \(g(0.6) - g(0.4)\)

For more details, please refer to Friedman, Hastie, Tibshirani, "Additive Logistic Regression: A Statistical View of Boosting" , The Annals of Statistics 2000, Vol. 28, No. 2, 337–407

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 2 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

add_bias | true|false|1|0 | true | add a constant bias term to the classifier? |

decode | true|false|1|0 | true | run the decoder (link function) after classification? |

link_function | logit probit comp_log_log linear log | logit | which link function to use for the output function |

regularization | none l1 l2 | l2 | type of regularization on the weights (L1 is slower due to an iterative algorithm) |

regularization_factor | -1 to infinite | 1.0000000000000001e-05 | regularization factor to use. auto-determined if negative (slower). the bigger this value is, the more regularization on the weights |

max_regularization_iteration | 1 to infinite | 1000 | maximum number of iterations for the L1 regularization |

regularization_epsilon | positive number | 0.0001 | smallest weight update before assuming convergence for the L1 iterative algorithm |

normalize | true|false|1|0 | true | normalize features to have zero mean and unit variance for greater numeric stability (slower training but recommended with L1 regularization) |

condition | true|false|1|0 | false | condition features to have no correlation for greater numeric stability (but much slower training) |

feature_proportion | 0 to 1 | 1 | use only a (random) portion of available features when training classifier |

The different options for the `link_function`

parameter are defined as follows:

Name | Link Function | Activation Function (inverse of the link function) |
---|---|---|

logit | \[g(x)=\ln \left( \frac{x}{1-x} \right) \] | \[g^{-1}(x) = \frac{1}{1 + e^{-x}}\] |

probit | \(g(x)=\Phi^{-1}(x)\) where \(\Phi\) is the normal distribution's CDF |
\[g^{-1}(x) = \Phi (x)\] |

comp_log_log | \[g(x)=\ln \left( - \ln \left( 1-x \right) \right)\] | \[g^{-1}(x) = 1 - e^{-e^x}\] |

linear | \[g(x)=x\] | \[g^{-1}(x) = x\] |

log | \[g(x)=\ln x\] | \[g^{-1}(x) = e^x\] |

The bagging algorithm, also known as bootstrap aggregating, is used in conjunction with another algorithm, for
instance with a decision tree to create
*bagged decision trees*. There is an example of this in the default configuration file for the `bdt`

key.

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 54 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

num_bags | N>=1 | 10 | number of bags to divide classifier into |

validation_split | 0 | 0.349999994 | how much of training data to hold off as validation data |

weak_leaner | perceptron, bagging, boosting, naive_bayes, stump, decision_tree, glz, boosted_stumps, null, onevsall, fasttext |

*See also* : Bagging on Wikipedia.

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 2 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

validation_split | 0 | 0.300000012 | how much of training data to hold off as validation data |

min_iter | 1-max_iter | 10 | minimum number of training iterations to run |

max_iter | >=min_iter | 500 | maximum number of training iterations to run |

cost_function | exponential logistic | exponential | select cost function for boosting weight update |

short_circuit_window | 0- | 0 | short circuit (stop) training if no improvement for N iter (0 off) |

trace_training_acc | true|false|1|0 | false | trace the accuracy of the training set as well as validation |

weak_leaner | perceptron, bagging, boosting, naive_bayes, stump, decision_tree, glz, boosted_stumps, null, onevsall, fasttext |

*See also* : Boosting on Wikipedia.

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 2 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

validation_split | 0 | 0.300000012 | how much of training data to hold off as validation data |

min_iter | 1-max_iter | 10 | minimum number of training iterations to run |

max_iter | >=min_iter | 100 | maximum number of training iterations to run |

learning_rate | real | 0.00999999978 | positive: rate of learning relative to dataset size: negative for absolute |

arch | (see doc) | %i | hidden unit specification; %i=in vars, %o=out vars; eg 5_10 |

activation | logsig tanh tanhs identity softmax nonstandard | tanh | activation function for neurons |

output_activation | logsig tanh tanhs identity softmax nonstandard | tanh | activation function for output layer of neurons |

decorrelate | true|false|1|0 | true | decorrelate the features before training |

normalize | true|false|1|0 | true | normalize to zero mean and unit std before training |

batch_size | 0.0-1.0 or 1 - nvectors | 1024 | number of samples in each "mini batch" for stochastic |

target_value | 0.0-1.0 | 0.800000012 | the output for a 1 that we ask the network to provide |

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 2 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

trace | 0- | 0 | trace execution of training in a very fine-grained fashion |

feature_prop | 0.0 | 1 | which proportion of features do we look at |

Note that our version of the Naive Bayes Classifier only supports discrete
features. Numerical-valued columns (types `NUMBER`

and `INTEGER`

) are accepted,
but they will be discretized prior to training. To do so, we will simply split
all the values in two, using the threshold that provides the best separation
of classes. You can always do your own discretization, for instance using a
`CASE`

expression.

Parameter | Range | Default | Description |
---|---|---|---|

verbosity | 0-5 | 2 | verbosity of information from training |

profile | true|false|1|0 | false | whether or not to profile |

validate | true|false|1|0 | false | perform expensive internal validation |

epoch | 1+ | 5 | Number of iterations over the data |

dims | 1+ | 100 | Number of dimensions in the embedding |

verbosity | 0+ | 0 | Level of verbosity in standard output |

Note that our version of the Fast Text Classifier only supports feature counts, and currently does not support regression.

*See also* : fastText on arXiv.

`configurationFile`

The default, overrideable `configurationFile`

contains the following predefined configurations, which can be accessed by name with the `algorithm`

parameter:

```
{
"nn": {
"_note": "Neural Network",
"type": "perceptron",
"arch": 50,
"verbosity": 3,
"max_iter": 100,
"learning_rate": 0.01,
"batch_size": 10
},
"bbdt": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 3,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 0.5
},
"min_iter": 5,
"max_iter": 30
},
"num_bags": 5
},
"bbdt2": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 1,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 5,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 0.8
},
"min_iter": 5,
"max_iter": 10,
"verbosity": 0
},
"num_bags": 32
},
"bbdt_d2": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 2,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 1
},
"min_iter": 5,
"max_iter": 30
},
"num_bags": 5
},
"bbdt_d5": {
"_note": "Bagged boosted decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 5,
"verbosity": 0,
"update_alg": "gentle",
"random_feature_propn": 1
},
"min_iter": 5,
"max_iter": 30
},
"num_bags": 5
},
"bdt": {
"_note": "Bagged decision trees",
"type": "bagging",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"verbosity": 0,
"max_depth": 5
},
"num_bags": 20
},
"dt": {
"_note": "Plain decision tree",
"type": "decision_tree",
"max_depth": 8,
"verbosity": 3,
"update_alg": "prob"
},
"glz_linear": {
"_note": "Generalized Linear Model, linear link function, to be used for 'regression' mode",
"type": "glz",
"link_function": "linear",
"verbosity": 3,
"normalize ": "true",
"regularization" = "l2"
},
"glz": {
"_note": "Generalized Linear Model. Very smooth but needs very good features",
"type": "glz",
"verbosity": 3,
"normalize ": " true",
"regularization" = "l2"
},
"glz2": {
"_note": "Generalized Linear Model. Very smooth but needs very good features",
"type": "glz",
"verbosity": 3
},
"bglz": {
"_note": "Bagged random GLZ",
"type": "bagging",
"verbosity": 1,
"validation_split": 0.1,
"weak_learner": {
"type": "glz",
"feature_proportion": 1.0,
"verbosity": 0
},
"num_bags": 32
},
"bs": {
"_note": "Boosted stumps",
"type": "boosted_stumps",
"min_iter": 10,
"max_iter": 200,
"update_alg": "gentle",
"verbosity": 3
},
"bs2": {
"_note": "Boosted stumps",
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 1,
"verbosity": 0,
"update_alg": "gentle"
},
"min_iter": 5,
"max_iter": 300,
"trace_training_acc": "true"
},
"bbs2": {
"_note": "Bagged boosted stumps",
"type": "bagging",
"num_bags": 5,
"weak_learner": {
"type": "boosting",
"verbosity": 3,
"weak_learner": {
"type": "decision_tree",
"max_depth": 1,
"verbosity": 0,
"update_alg": "gentle"
},
"min_iter": 5,
"max_iter": 300,
"trace_training_acc": "true"
}
},
"naive_bayes": {
"_note": "Naive Bayes",
"type": "naive_bayes",
"feature_prop": "1",
"verbosity": 3
}
}
```

This section describes how you can set different weights for each example in your training set, either based upon the label or based upon a calculation over the row, to enable finer control over which examples the classifier makes the most effort to classify.

The `equalizationFactor`

parameter can be used to adjust an unbalanced training
set to be more balanced for training, which frequently has the effect of
requiring the classifiers to focus more on separating the positive and negative
classes rather then getting really high scores for the dominant class.

- Setting this parameter to 0.0 weights the parameters according to
the
`weight`

expression in`trainingData`

. - Setting this parameter to 1.0 will adjust the weights such that each class has exactly identical weight.
- Setting it to something else (0.5, the default, is a good value for most unbalanced training set use cases) will multiply the weights of each class according to \( w_{class} \rightarrow w_{\textrm{class}} \times \left( \sum {w_{\textrm{class}}} \right) ^{-\textrm{equalizationFactor}} \)

The optional `weight`

expression in the `trainingData`

parameter of the
configuration must evaluate to a positive number that implies how many
examples this counts for. For example, a single row with a weight of 2, or the
same single row duplicated twice with a weight of 1 will have the same effect.

Note that only the relative weights matter. Before the classifier is trained, the weights will be normalized so that they sum to 1 to avoid numerical issues in the classifier training process.

If the two weighting methods are combined, then the `weight`

expression will be
used to set the relative weight per example within its label class, and the
`equalizationFactor`

will adjust the relative weight of each class.

- The
`classifier.train`

procedure type trains a classifier. - The
`classifier.test`

procedure type allows the accuracy of a predictor to be tested against held-out data. - The
`classifier.experiment`

procedure type performs both training and testing of a classifier.