Revision

Back to Machine Learning

Definition

A decision tree is a classification or regression model based on a tree structure with input X and outputs Y. In a classification decision tree Y is a categorical variable and each leaf corresponds to a given class. In a regression decision tree Y is a continuous variable and the goal is to obtain leafs with minimum variances intra.

At each step, the algorithm chooses the explanatory variables that best splits the data (we will see after how to measure that).

Different types of decision trees exist depending on the split criterion.

Output type

Categorical

Categorical outputs for a classification problem is the natural application of a decision tree. For a given leaf, the associated value is the most popular class in the leaf (most popular class during training).

Continuous

For continuous output, for a given leaf, the associated value is the mean of the values of the leaf (mean of the training values).

Binary vs multi-way splitting

A multi way splitting can always be reproduced by series of binary splits.

The majority of the implementations use binary splitting but some implementation (in R for example or in lightGBM) use multi-way splittings (see Resource 1).

A argument against multi-way splits is that data is fragmented too quickly leaving insufficient data at the next level (see Resource 2).

Resources

See:

Stopping criteria: Pre-pruning

If no stopping criterion is specified, the algorithm will split the data until obtaining a perfect tree (each value are correctly labeled for categorical output or each value is map to its true value for continuous variables). Also note that a single explanatory variable may be reused multiple time (thus, even with a simple explanatory variable it is possible to perfectly match the true labels).

Different stopping criteria exist:

Maximum depth of the tree
Maximum number of leafs
Minimum number of values in each leaf
Minimum number of values in a leaf to split it
Minimum quality criterion increase (minimum impurity decrease)

The methods of setting an early stopping criterion is know as pre-pruning. Post-pruning is another technic that will reduce the size of the tree after creating it (see the part on Pruning at the end).

Type of explanatory variables

Depending of the feature’s type, the splitting may be more or less complicated. Indeed, the algorithm will choose the explanatory variable that best splits the data but for continuous or categorical (with more than 2 categories) variables, different possible splittings exist.

For example if the variable is the age, should the algorithm split in older than 50 and younger than 50 or older than 30 and younger than 30? For a categorical variables with unordered categories 1, 2, 3, 4, should the algorithm splits 1 vs 2, 3 and 4 or 1 and 2 vs 3 and 4 or 1 and 3 vs 2 and 4 etc.?

Binary explanatory variables

For a binary explanatory variable, the split decision is easy, the algorithm splits the data in two groups based on the selected feature’s value.

Continuous explanatory variables

For continuous explanatory variables, the split may be done anywhere in the space of this variable. Theoretically, the number of possible splits is thus infinite. Two different methods exist to deal with this.

Pre-sort based algorithm

This number of possible splits can be reduce to the number of different values as splitting anywhere between 2 consecutive values will output the same result. For example if my values are 1.2, 2.3, 2.6 and 3.6, splitting at 3 or 3.1 will lead to the same result.

Hence pre-sort-based algorithm sorts the continuous variable considered and then choose the best possible split among the \(n_{value}-1\) possibilities.

Each sort of a feature costs \(O(n\log(n))\) time complexity and then each \(n-1\) splits are tested.

Histogram based algorithm

Histogram based algorithm buckets the continuous variables into bins and treats them as sorted categorical values. The cost of bucketing a feature is \(O(n)\) and then \(n_{bins}\) splits are tested.

Categorical explanatory variables

Categorical unordered variable are a little bit more complex to deal with as c categories can be binary split in \(2^{c-1}-1\) different splits. The solution is to ordered this categories and apply the same method as for continuous variables.

The ordered is done using the output variables (see below).

Binary output

If the output variable is a binary category, the ordering is done using the proportion of category 1 (or 0) in each class.

Continuous output

If the output variable is continuous, the ordering is done using the mean output value of each of the feature’s category.

Categorical output

If the output variable is a category, different methods exist, list in this Matlab page.

One-vs-all may be used: each output class is tried as the ‘one’ and the feature’s category are then sorted using the proportion of this category (as for a binary output),
Pull left by purity: 1) Start with all feature categories on the right leaf; 2) inspect the \(K\) categories that have the largest class probabilities for each output class; 3) Move the category with the maximum value of the split criterion (we will see later the split criteria) to the left branch; 4) Continue moving categories from right to left, recording the split criterion at each move, until the right child has only one category remaining; 5) Choose the split that gave the best criterion value during the sequence.
Principal Component-Based Partitioning: see Matlab page

Resources

See:

Splitting methods

Depending on the output data type (continuous or categorical), different measure for splits exist.

Categorical output splitting methods

For categorical outputs, different splitting methods with similar but different results exist. See this graph for an overview:

Let’s now look more in details these three metrics and their associated algorithms.

CART: Gini impurity

CART (for Classification And Regression Tree) is an algorithm based on a split criterion called Gini impurity.

Assume that, at a given step, we have G different groups. Let g be one of the group in G and let \(n_{class}\) be the total number of class (with some class present in g and other not present). The Gini impurity measure for the group g is then:

\[I^{Gini}(g)=\sum_{i=1}^{n_{class}} p(i \vert g)(1-p(i \vert g))=\sum_{i=1}^{n_{class}}(p(i \vert g)-p(i \vert g)^2)=\sum_{i=1}^{n_{class}} p(i \vert g)-\sum_{i=1}^{n_{class}} p(i \vert g)^2=1-\sum_{i=1}^{n_{class}} p(i \vert g)^2\]

Where:

\(p(i \vert g)\) is the probability to be of class i in group g. It is computed as \(f_g(i)\), the fraction of elements of class i in group g.

To compare different splittings (using different features) we compare the weighted sum of the Gini impurity measure over all the obtained groups.

Example

Assume a sample of data where \(Y \in \{A, B\}\) with 10 A and 10 B. X are the explanatory variables.

Let’s compare the total Gini impurity measure obtained splitting via 2 different features:

Split using feature 1

Group 1: 9 A and 1 B
Group 2: 1 A and 9 B

The total Gini impurity measure splitting using feature 1 is :

\[\frac{10}{20}\left[1-\left(\frac{9}{10}\right)^2-\left(\frac{1}{10}\right)^2\right]+\frac{10}{20}\left[1-\left(\frac{1}{10}\right)^2-\left(\frac{9}{10}\right)^2\right]=0.04+0.04=0.08\]

Split using feature 2

Group 1: 10 A and 2 B
Group 2: 8 B

The total Gini impurity measure splitting using feature 2 is :

\[\frac{12}{20}\left[1-\left(\frac{10}{12}\right)^2-\left(\frac{2}{12}\right)^2\right]+\frac{8}{20}\left[1-\left(\frac{0}{8}\right)^2-\left(\frac{8}{8}\right)^2\right]=0.17+0=0.17\]

So given the Gini impurity measure, the split using feature 1 is better (lower is better).

ID3, C4.5: Entropy impurity (Information gain)

Entropy impurity is a measure similar to Gini impurity measure used by algorithm ID3 and its successor C4.5.

Once again, assume that, at a given step, we have G different groups. Let g be one of the group in G and let \(n_{class}\) be the total number of class (with some class present in g and other not present). The entropy impurity measure for the group g is then:

\[I^{Entropy}(g)=-\sum_{i=1}^{n_{class}}p(i \vert g) \log_2 (p(i \vert g))\]

Where:

\(p(i \vert g)\) is the probability to be of class i in group g. It is computed as \(f_g(i)\), the fraction of elements of class i in group g.

To compare different splittings (using different features) we compare the weighted sum of the Entropy impurity measure over all the obtained groups.

Information gain is the difference between the

Example

Assume a sample of data where \(Y \in \{A, B\}\) with 10 A and 10 B. X are the explanatory variables.

Let’s compare the total Entropy impurity measure obtained splitting via 2 different features:

Split using feature 1

Group 1: 9 A and 1 B
Group 2: 1 A and 9 B

The total Entropy impurity measure splitting using feature 1 is :

\[-\frac{10}{20} \left[\left(\frac{9}{10}\right)\log\left(\frac{9}{10}\right) + \left(\frac{1}{10}\right)\log\left(\frac{1}{10}\right)\right] -\frac{10}{20} \left[\left(\frac{1}{10}\right)\log\left(\frac{1}{10}\right)+\left(\frac{9}{10}\right)\log\left(\frac{9}{10}\right)\right]=0.23+0.23=0.46\]

Split using feature 2

Group 1: 10 A and 2 B
Group 2: 8 B

The total Entropy impurity measure splitting using feature 2 is :

\[-\frac{12}{20} \left[\left(\frac{10}{12}\right)\log\left(\frac{10}{12}\right) + \left(\frac{2}{12}\right)\log\left(\frac{2}{12}\right)\right] -\frac{8}{20} \left[\left(\frac{0}{8}\right)\log\left(\frac{0}{8}\right)+\left(\frac{8}{8}\right)\log\left(\frac{8}{8}\right)\right]=0.41+0=0.41\]

So given the Entropy impurity measure, the split using feature 2 is better (lower is better).

Classification Error

Another measure can be the classification error. We associate all the elements of a group to the majority of the group.

\[I^{ClassErr}(g) = 1 - \max_i(p(i \vert g))\]

Where:

\(p(i \vert g)\) is the probability to be of class i in group g. It is computed as \(f_g(i)\), the fraction of elements of class i in group g.

To compare different splittings (using different features) we compare the sum of the Classification error measure over all the obtained groups.

Example

Assume a sample of data where \(Y \in \{A, B\}\) with 10 A and 10 B. X are the explanatory variables.

Let’s compare the total Classification Error measure obtained splitting via 2 different features:

Split using feature 1

Group 1: 9 A and 1 B
Group 2: 1 A and 9 B

The total Classification Error measure splitting using feature 1 is :

\[\frac{10}{20}\left[1-\max\left(\frac{9}{10}, \frac{1}{10}\right)\right]+\frac{10}{20}\left[1-\max\left(\frac{1}{10}, \frac{9}{10}\right)\right]=0.05+0.05=0.1\]

Split using feature 2

Group 1: 10 A and 2 B
Group 2: 8 B

The total Classification Error measure splitting using feature 2 is :

\[\frac{12}{20}\left[1-\max\left(\frac{10}{12}, \frac{2}{12}\right)\right]+\frac{8}{20}\left[1-\max\left(\frac{0}{8}, \frac{8}{8}\right)\right]=0.1+0.0=0.102\]

So given the Classification Error measure measure, the split using feature 1 is a little bit better (lower is better).

MARS

MARS for Multivariate adaptive regression spline is another decision tree algorithm.

Resources

See:

MARS Wikipedia page.

Pruning

Pruning are a set of methods to limit or reduce a decision tree size.

Pre-pruning

See the part Stopping criteria: Pre-pruning.

Post-pruning

Top down pruning

Top down pruning starts at the initial node of the tree and decides to cut one of the two substrees depending on a criteria.

Pessimistic error pruning

See:

An empirical comparison of pruning methods for decision tree induction p. 234 (document starting p. 228).

Bottom up pruning

Bottom up pruning starts at the leafs and decides to cut it or not depending on a criteria.

Reduced error pruning

Reduced error pruning requires a validation set and works as follow:

Starting at the leaves, each node is replaced with its most popular class (ie its leaves are cut),
If the prediction accuracy on validation set is not affected then the change is kept.

Information gain pruning

Information gain pruning cut the leaf with smaller information gain. Leafs are cut until the desired number of leafs is obtained (this number of leafs is then an hyperparameter).

Minimum cost complexity pruning

Cost complexity pruning is the method implemented in Scikit-Learn. It recursively cut subtrees of the tree based on a criteria. The goal of cost complexity pruning is to find subtrees to cut that minimise the size of the tree and the error rate.

The algorithm generates a series of tree \(T_0\), \(T_1\), \(T_2\), …, \(T_m\) where \(T_0\) is the initial tree and \(T_m\) is the root alone. At each step \(i\), the subtree \(t\) that minimise the following criteria is chosen to be cut:

\[\min_{t \in Subtree(T_{i-1})} \text{ } \frac{Error(prune(T_{i-1}, t))-Error(T_{i-1})}{\alpha(leaves(T_{i-1})-leaves(prune(T_{i-1}, t)))}\]

Where:

\(T_{i-1}\) is the tree from the previous step, \(T_0\) being the initial unpruned tree,
\(Subtree(T_{i-1})\) is the ensemble all possible subtree of \(T_{i-1}\),
\(t\) is one of the subtrees of \(Subtree(T_{i-1})\),
\(prune(T_{i-1}, t)\) is the tree obtained by cutting \(t\) from \(T_{i-1}\),
\(Error(T)\) is the sum of classification errors of the tree \(T\) on training data,
\(leaves(T)\) is the number of leaves in tree \(T\),
\(\alpha\) is an hyperparameter that defines how important obtaining a shallow tree is.

At the end, the output decision tree is the tree from the series \(T_0\), \(T_1\), …, \(T_m\) that minimises the loss on the training set or a validation set. A high \(\alpha\) will lead to very shallow decision tree.

Resources

See:

Optimistic pruning

Algorithms summary

Resources

See:

Ressources

See: