Classifier configuration

MLDB supports many algorithms to solve supervised learning tasks for both classification and regression. There are two procedures that are available to train a model:

the classifier.experiment procedure type performs both training and testing of a classifier.
the classifier.train procedure type trains a classifier. If testing is required, it needs to be done manually with the classifier.test procedure type.

Both of those procedures share the configuration keys algorithm, configuration, configurationFile and equalizationFactor. This document explains how to use them.

Outline

Methods of configuring a classifier training
Configuration file contents
Algorithms
Default configuration file
Training Weighting

Methods of configuring a classifier training

There are three ways of configuring which classifier will be trained:

Leave the configuration and configurationFile empty, and choose a standard algorithm configuration by name. See below for the contents of the default configurationFile).
Put the configuration inline in the configuration parameter (JSON) and set algorithm to either empty (if the configuration is at the top level) or to the dot separated path if it's not at the top level. See below for details on specifying your own configuration.
Put the configuration in an external resource identified by the configurationFile parameter, and set the algorithm as in number 2. See below for details on specifying your own configurationFile.

Configuration file contents

A configuration JSON object or the contents of a configurationFile looks like this (see below for the contents of the default, overrideable configurationFile:

{
    "algorithm_name": {
        "type": "classifier_type",
        "parameter": "value",
        ...
    },
    ...
}

The classifier training procedure includes support for the following types of classifiers.

These classifiers tend to be high performance implementations of well known classifiers which train and predict fast and are often a good default choice when a generic classification step is required.

Algorithms

Decision Trees (type=decision_tree)

Parameter	Range	Default	Description
verbosity	0-5	2	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
trace	0-	0	trace execution of training in a very fine-grained fashion
max_depth	0- or -1	-1	give maximum tree depth. -1 means go until data separated
update_alg	normal gentle prob	prob	select the type of output that the tree gives
random_feature_propn	0.0-1.0	1	proportion of the features to enable (for random forests)

The update_alg parameter can take three different values: prob, normal and gentle. Here is how they work, using an example with a leaf node that contains 8 positive and 2 negative labels:

prob is the proportion of positive classes: \(\#pos/(\#pos + \#neg)\). For our example the output is \(8/10=0.8\)
normal uses the margin between both probabilities, 80% positives, 20% negatives. For our example the output is\(0.8 - 0.2 = 0.6\) and \(1 - 0.6 = 0.4\). These scores are fed to a function, \(f\), of the exponential family (bounded between -infinity and +infinity). For our example the output is \(f(0.6) - f(0.4)\)
gentle also uses the margin, but with a different function, \(g\), bounded between -1 and 1. In an ensemble, such as boosting or random forest, it is recommended to use this value. For our example the output is \(g(0.6) - g(0.4)\)

For more details, please refer to Friedman, Hastie, Tibshirani, "Additive Logistic Regression: A Statistical View of Boosting" , The Annals of Statistics 2000, Vol. 28, No. 2, 337–407

Generalized Linear Models (type=glz)

Parameter	Range	Default	Description
verbosity	0-5	2	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
add_bias	true\|false\|1\|0	true	add a constant bias term to the classifier?
decode	true\|false\|1\|0	true	run the decoder (link function) after classification?
link_function	logit probit comp_log_log linear log	logit	which link function to use for the output function
regularization	none l1 l2	l2	type of regularization on the weights (L1 is slower due to an iterative algorithm)
regularization_factor	-1 to infinite	1.0000000000000001e-05	regularization factor to use. auto-determined if negative (slower). the bigger this value is, the more regularization on the weights
max_regularization_iteration	1 to infinite	1000	maximum number of iterations for the L1 regularization
regularization_epsilon	positive number	0.0001	smallest weight update before assuming convergence for the L1 iterative algorithm
normalize	true\|false\|1\|0	true	normalize features to have zero mean and unit variance for greater numeric stability (slower training but recommended with L1 regularization)
condition	true\|false\|1\|0	false	condition features to have no correlation for greater numeric stability (but much slower training)
feature_proportion	0 to 1	1	use only a (random) portion of available features when training classifier

The different options for the link_function parameter are defined as follows:

Name	Link Function	Activation Function (inverse of the link function)
logit	\[g(x)=\ln \left( \frac{x}{1-x} \right) \]	\[g^{-1}(x) = \frac{1}{1 + e^{-x}}\]
probit	\(g(x)=\Phi^{-1}(x)\) where \(\Phi\) is the normal distribution's CDF	\[g^{-1}(x) = \Phi (x)\]
comp_log_log	\[g(x)=\ln \left( - \ln \left( 1-x \right) \right)\]	\[g^{-1}(x) = 1 - e^{-e^x}\]
linear	\[g(x)=x\]	\[g^{-1}(x) = x\]
log	\[g(x)=\ln x\]	\[g^{-1}(x) = e^x\]

Bagging (type=bagging)

The bagging algorithm, also known as bootstrap aggregating, is used in conjunction with another algorithm, for instance with a decision tree to create bagged decision trees. There is an example of this in the default configuration file for the bdt key.

Parameter	Range	Default	Description
verbosity	0-5	54	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
num_bags	N>=1	10	number of bags to divide classifier into
validation_split	0	0.349999994	how much of training data to hold off as validation data
weak_leaner	perceptron, bagging, boosting, naive_bayes, stump, decision_tree, glz, boosted_stumps, null, onevsall, fasttext

See also : Bagging on Wikipedia.

Boosting (type=boosting)

Parameter	Range	Default	Description
verbosity	0-5	2	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
validation_split	0	0.300000012	how much of training data to hold off as validation data
min_iter	1-max_iter	10	minimum number of training iterations to run
max_iter	>=min_iter	500	maximum number of training iterations to run
cost_function	exponential logistic	exponential	select cost function for boosting weight update
short_circuit_window	0-	0	short circuit (stop) training if no improvement for N iter (0 off)
trace_training_acc	true\|false\|1\|0	false	trace the accuracy of the training set as well as validation
weak_leaner	perceptron, bagging, boosting, naive_bayes, stump, decision_tree, glz, boosted_stumps, null, onevsall, fasttext

See also : Boosting on Wikipedia.

Neural Networks (type=perceptron)

Parameter	Range	Default	Description
verbosity	0-5	2	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
validation_split	0	0.300000012	how much of training data to hold off as validation data
min_iter	1-max_iter	10	minimum number of training iterations to run
max_iter	>=min_iter	100	maximum number of training iterations to run
learning_rate	real	0.00999999978	positive: rate of learning relative to dataset size: negative for absolute
arch	(see doc)	%i	hidden unit specification; %i=in vars, %o=out vars; eg 5_10
activation	logsig tanh tanhs identity softmax nonstandard	tanh	activation function for neurons
output_activation	logsig tanh tanhs identity softmax nonstandard	tanh	activation function for output layer of neurons
decorrelate	true\|false\|1\|0	true	decorrelate the features before training
normalize	true\|false\|1\|0	true	normalize to zero mean and unit std before training
batch_size	0.0-1.0 or 1 - nvectors	1024	number of samples in each "mini batch" for stochastic
target_value	0.0-1.0	0.800000012	the output for a 1 that we ask the network to provide

Naive Bayes (type=naive_bayes)

Parameter	Range	Default	Description
verbosity	0-5	2	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
trace	0-	0	trace execution of training in a very fine-grained fashion
feature_prop	0.0	1	which proportion of features do we look at

Note that our version of the Naive Bayes Classifier only supports discrete features. Numerical-valued columns (types NUMBER and INTEGER) are accepted, but they will be discretized prior to training. To do so, we will simply split all the values in two, using the threshold that provides the best separation of classes. You can always do your own discretization, for instance using a CASE expression.

FastText (type=fasttext)

Parameter	Range	Default	Description
verbosity	0-5	2	verbosity of information from training
profile	true\|false\|1\|0	false	whether or not to profile
validate	true\|false\|1\|0	false	perform expensive internal validation
epoch	1+	5	Number of iterations over the data
dims	1+	100	Number of dimensions in the embedding
verbosity	0+	0	Level of verbosity in standard output

Note that our version of the Fast Text Classifier only supports feature counts, and currently does not support regression.

See also : fastText on arXiv.

Default `configurationFile`

The default, overrideable configurationFile contains the following predefined configurations, which can be accessed by name with the algorithm parameter:

{

    "nn": { 
        "_note": "Neural Network",
        
        "type": "perceptron",
        "arch": 50,
        "verbosity": 3,
        "max_iter": 100,
        "learning_rate": 0.01,
        "batch_size": 10
    },


    "bbdt": {
        "_note": "Bagged boosted decision trees",
        
        "type": "bagging",
        "verbosity": 3,
        "weak_learner": {
            "type": "boosting",
            "verbosity": 3,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 3,
                "verbosity": 0,
                "update_alg": "gentle",
                "random_feature_propn": 0.5
            },
            "min_iter": 5,
            "max_iter": 30
        },
        "num_bags": 5
    },

    "bbdt2": {
        "_note": "Bagged boosted decision trees",
        
        "type": "bagging",
        "verbosity": 1,
        "weak_learner": {
            "type": "boosting",
            "verbosity": 3,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 5,
                "verbosity": 0,
                "update_alg": "gentle",
                "random_feature_propn": 0.8
            },
            "min_iter": 5,
            "max_iter": 10,
            "verbosity": 0
        },
        "num_bags": 32
    },

    "bbdt_d2": {
        "_note": "Bagged boosted decision trees",
        
        "type": "bagging",
        "verbosity": 3,
        "weak_learner": {
            "type": "boosting",
            "verbosity": 3,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 2,
                "verbosity": 0,
                "update_alg": "gentle",
                "random_feature_propn": 1
            },
            "min_iter": 5,
            "max_iter": 30
        },
        "num_bags": 5
    },

    "bbdt_d5": {
        "_note": "Bagged boosted decision trees",
        
        "type": "bagging",
        "verbosity": 3,
        "weak_learner": {
            "type": "boosting",
            "verbosity": 3,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 5,
                "verbosity": 0,
                "update_alg": "gentle",
                "random_feature_propn": 1
            },
            "min_iter": 5,
            "max_iter": 30
        },
        "num_bags": 5
    },

    "bdt": {
        "_note": "Bagged decision trees",
        
        "type": "bagging",
        "verbosity": 3,
        "weak_learner": {
            "type": "decision_tree",
            "verbosity": 0,
            "max_depth": 5
        },
        "num_bags": 20
    },

    "dt": {
        "_note": "Plain decision tree",
        
        "type": "decision_tree",
        "max_depth": 8,
        "verbosity": 3,
        "update_alg": "prob"
    },

    "glz_linear": {
        "_note": "Generalized Linear Model, linear link function, to be used for 'regression' mode",

        "type": "glz",
        "link_function": "linear",
        "verbosity": 3,
        "normalize ": "true",
        "regularization" = "l2"
    },

    "glz": {
        "_note": "Generalized Linear Model.  Very smooth but needs very good features",

        "type": "glz",
        "verbosity": 3,
        "normalize ": " true",
        "regularization" = "l2"
    },

    "glz2": {
        "_note": "Generalized Linear Model.  Very smooth but needs very good features",

        "type": "glz",
        "verbosity": 3
    },

    "bglz": {
        "_note": "Bagged random GLZ",

        "type": "bagging",
        "verbosity": 1,
        "validation_split": 0.1,
        "weak_learner": {
            "type": "glz",
            "feature_proportion": 1.0,
            "verbosity": 0    
        },
        "num_bags": 32
    },


    "bs": {
        "_note": "Boosted stumps",

        "type": "boosted_stumps",
        "min_iter": 10,
        "max_iter": 200,
        "update_alg": "gentle",
        "verbosity": 3
    },

    "bs2": {
        "_note": "Boosted stumps",

        "type": "boosting",
        "verbosity": 3,
        "weak_learner": {
            "type": "decision_tree",
            "max_depth": 1,
            "verbosity": 0,
            "update_alg": "gentle"
        },
        "min_iter": 5,
        "max_iter": 300,
        "trace_training_acc": "true"
    },

    "bbs2": {
        "_note": "Bagged boosted stumps",

        "type": "bagging",
        "num_bags": 5,
        "weak_learner": {
            "type": "boosting",
            "verbosity": 3,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 1,
                "verbosity": 0,
                "update_alg": "gentle"
            },
            "min_iter": 5,
            "max_iter": 300,
            "trace_training_acc": "true"
        }
    },

    "naive_bayes": {
        "_note": "Naive Bayes",

        "type": "naive_bayes",
        "feature_prop": "1",
        "verbosity": 3
    }
}

Training Weighting

This section describes how you can set different weights for each example in your training set, either based upon the label or based upon a calculation over the row, to enable finer control over which examples the classifier makes the most effort to classify.

Equalizing class weights

The equalizationFactor parameter can be used to adjust an unbalanced training set to be more balanced for training, which frequently has the effect of requiring the classifiers to focus more on separating the positive and negative classes rather then getting really high scores for the dominant class.

Setting this parameter to 0.0 weights the parameters according to the weight expression in trainingData.
Setting this parameter to 1.0 will adjust the weights such that each class has exactly identical weight.
Setting it to something else (0.5, the default, is a good value for most unbalanced training set use cases) will multiply the weights of each class according to \( w_{class} \rightarrow w_{\textrm{class}} \times \left( \sum {w_{\textrm{class}}} \right) ^{-\textrm{equalizationFactor}} \)

Setting example weight explicitly

The optional weight expression in the trainingData parameter of the configuration must evaluate to a positive number that implies how many examples this counts for. For example, a single row with a weight of 2, or the same single row duplicated twice with a weight of 1 will have the same effect.

Note that only the relative weights matter. Before the classifier is trained, the weights will be normalized so that they sum to 1 to avoid numerical issues in the classifier training process.

Combining the two

If the two weighting methods are combined, then the weight expression will be used to set the relative weight per example within its label class, and the equalizationFactor will adjust the relative weight of each class.