Revision

Machine Learning

Hyperparameter tuning

Some models (the majority of the models) have hyperparameters.

Hyperparameters are parameters that won’t be statistically calibrated but which are chosen arbitrarily by the user of the model.

Example of hyperparameters

Number of layers and neurons by layers in a neural network,
Regularization parameter in a penalized regression,
Maximum depth of a decision tree,
Splitting criterion of a decision tree,
Number of trees in a random forest,
Percentage of data in each decision tree of a random forest,
Maximum number of tree in a Gradient Boosting,
Learning rate in a Gradient Boosting,
Regularization parameter in a SVM,
Kernel of a SVM,
Number of cluster in K-means,
Number of dimension in PCA,
Minimum distance to be considered neighbours in DBSCAN,
Minimum number of neighbours to be considered a core sample in DBSCAN,
Distance metric and splitting criterion in Hierarchical Clustering,
Maximum distance in Hierarchical Clustering,
…

How to choose the right hyperparameter value?

Chosing the right hyperparameter value is a difficult task even more when the number of hyperparameters is important as it increases the number of possible models.

A good understanding of the model, the input and the output data is helpful to choose wizely the hyperparameters.

Another solution is to train the model with different combination of hyperparameters and test the results of each of them on a validation set and keep the one giving the best results. See Cross validation for more information on train, validation and test set.

The number of combinations to test is arbitrarily chosen depending on the time and computation power available and the combinations may be chosen using two different methods.

Grid search

Grid search defines for each hyperparameter a set of values to try and then tests all possible combinations of the hyperparameters values.

For example for a Gradient Boosting algorithm, if I want to try:

Maximum depths of each decision tree: \([3, 9, 15]\),
Maximum number of trees: \([100, 1000, 2000, 10000]\),
Learning rate: \([10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10]\).

Then grid search will test all the possible combination of these hyperparameters (ie 72 combinations).

Random search

Random search will test a given number of combinations randomly drawn from specified probabilistic distributions.

For example for a Gradient Boosting algorithm, if I want to try:

Maximum depths of each decision tree: \(\mathcal{U}(3, 15)\),
Maximum number of trees: \(\mathcal{R}(10^2, 10^4)\),
Learning rate: \(\mathcal{R}(10^{-4}, 10)\).

Where:

\(\mathcal{U}\) is the discrete uniform distribution,
\(\mathcal{R}\) is the reciprocal (log-uniform) distribution.

Then grid search will test \(n\) (\(n\) being chosen by the user) randomly drawn combination of these hyperparameters.

Comparison of Grid search and Random search

In general it is better to use random search as it will test more possible values for each hyperparameters.

Here is a visual proof of this:

Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. (text from CS231n course).

Resources

See: