Revision

Back to Boosting

Informal introduction

Gradient Boosting is an ensemble models. Its goal is to combine weak learners (noted \(h\)) into one strong learner (noted \(F\)).

For this introduction let’s define:

The observed variables as \(Y\),
The explanatory variables as \(X\),
The predictor (strong learner) as \(F_m\),
The loss as sum of squares loss with factor \(\frac{1}{2}\).

\[L(h_m(X), Y)=\frac{1}{2} \Vert Y - F_m(X) \Vert_2^2 = \frac{1}{2} \sum_{j=0}^{n_{individuals}}(Y^{(j)} - F_m(X^{(j)}))^2\]

\(F_m\) is the strong learner, sum of the results of \(m\) weak learners \((h_i)_{i \leq m}\).
Now we want to update \(F_m\) in \(F_{m+1}\) by adding a new weak learner \(h_{m+1}\) that would make the loss decrease.

A natural specification would be to find \(F_{m+1}\) such that:

\[F_{m+1}(X)=Y\]

As \(F_{m+1}(X)=F_m(X)+h_{m+1}(X)\), the problem is equivalently to find \(h_{m+1}\) such that:

\[F_m(X)+h_{m+1}(X)=Y\]

Or:

\[h_{m+1}(X)=Y-F_m(X)\]

So in this configuration the algorithm will fit the new weak learner to the residuals.

Coming back to the loss function, something more interesting appears. Let’s apply the loss function to \(Y\) and \(F_m(X)\):

\[L(F_m(X), Y)=\frac{1}{2}\Vert Y - F_m(X) \Vert_2^2\]

And now taking the negative gradients with respect to \(F_m(X)\):

\[\begin{eqnarray} \frac{d}{dF_m(X)}L(F_m(X), Y) &&= -\frac{1}{2}\frac{d}{dF_m(X)}\Vert Y - F_m(X) \Vert_2^2 \\ &&= -\frac{1}{2}\left(-2(Y - F_m(X))\right) \\ &&= Y - F_m(X) \end{eqnarray}\]

So using this specific loss, fitting a new learner \(h_{m+1}\) on the residuals is equivalent to fit this learner to the negative gradients of the current loss. It is a Gradient Descent approach of the problem optimisation.

Algorithm

Fitting learners to the residuals is not necessarly what is wanted.

We now have the capacity to fit learners to the negative gradients of a given loss and it allows us to use any loss function for Gradient Boosting.

Let’s define the process more formally.

Variables definition

The goal is to find a function (a model) \(\hat{T_G}\) that best approximates the observed variables \(Y\) given the explanatory variable \(X\). This measure is done using a loss function \(L\).

So:

\[\hat{T_G}=arg\min_{T_G}L(Y, T_G(X))\]

In the case of Gradient Boosting, \(\hat{T_G}\) is a weighted sum of weak learners where the weak learners are CART Decision Trees:

\[\hat{T_G}=\sum_{k=1}{n_{trees}}w_k t_k\]

Where:

\(t_k\) is the CART tree from the \(k\)-th iteration of the algorithm,
\(w_k\) is the weight associated with the tree \(t_k\).

Initialization

Now let’s define \(T_0=t_0\), the first strong learner, ensemble of 1 weak learner:

\[T_0=arg\min_{t}L(Y, t(X))\]

Where:

\(t\) is the CART tree that best fit the data.

Now the recursion is the following:

Recursion

The recursion from \(T_{k-1}\) to \(T_k\) is as follow:

\[T_k=T_{k-1}+t_k\]

Where:

\(t_k=arg\min_{t}L(Y, T_{k-1}(X)+t(X))\).

Trick using Taylor expansion

An approximation using the Taylor expansion accelerates the estimation of \(L(Y, T_{k-1}+t(X))\) (considering \(t(X)\) as a small step):

\[L(Y, T_{k-1}(X)+t(X))=L(Y, T_{k-1}(X))+t(X)\frac{d}{dT_{k-1}(X)}L(Y, T_{k-1}(X)) + o(t(X))\]

Finding the best \(t(X)\) can be done by finding the weak learner that minimizes the Taylor expansion of \(L(Y, T_{k-1}(X)+t(X))\). This reformulation will be useful to easily compute the gradient wrt \(t(X)\).

Link to Gradient Descent

Finding the argmin of the Taylor expansion of \(L(Y, T_{k-1}(X)+t(X))\) can be done using gradient descent with respect to \(t(X)\).

The derivative wrt \(t(X)\) is:

\[\frac{d}{dt(X)}\left[L(Y, T_{k-1}(X))+t(X)\frac{d}{dT_{k-1}(X)}L(Y, T_{k-1}(X))\right]=\frac{d}{dT_{k-1}(X)}L(Y, T_{k-1}(X))\]

Thus, the learner \(t(x)\) that will produce the best step is the one that fit the best to \(\frac{d}{dT_{k-1}(X)}L(Y, T_{k-1}(X))\).

Going back the the \(\frac{1}{2}\) MSE loss, the learner \(t\) that optimize the step is the one that fit best to \(Y - F_m(X)\).

Update of the model

Using an hyperparameter \(\alpha\) for the learning rate we obtain:

\[T_k=T_{k-1}-\alpha\frac{d}{dT_{k-1}(X)}L(Y, T_{k-1}(X))\]

Different losses used in Gradient Boosting

See the lightGBM page with the supported losses.

logitboost

logitboost is gradient boosting applied to the log loss (sometimes called deviance loss or logistic loss):

\[L(h(X), Y)=Y \log(h(X)) + (1-Y) (1-\log(h_\beta(X)))\]

or equivalently (without applying the log function):

\[L(h(X), Y)=h(X)^Y + (1-h_\beta(X))^{(1-Y)}\]

See:

\(L_2\) boosting

\(L_2\) boosting is gradient boosting applied to the \(L_2\) loss also know as mean average loss (MAE):

\[L(h(X), Y)=\frac{1}{2}(Y-h(X))^2\]

And:

\[\frac{d}{dh(X)}L(h(X), Y)=h(X)-Y\]

Hence, each learner \(t_k\) is trained on the error from \(T_{k-1}(X)-Y\) where \(T_{k-1}\) is a strong learner, ensemble of the \({k-1}\) first trees.

Ressources

See: