Road to ML Engineer #10 - Gradient Boost

Last Edited: 8/13/2024

The blog post introduces how gradient boost works in machine learning.

ML

While there are many mechanisms, such as pruning, to prevent trees from overfitting, decision trees still tend to struggle with high-dimensional and complex data. This happens because the trees inevitably need to become more complex in order to model such data accurately, without underfitting.

Bagging & Boosting

How do we maintain the simplicity of the model while ensuring it can fit well on high-dimensional and complex data? One solution to this problem is bagging, where we build many simple trees and combine their results to make a prediction. The most intuitive way to do this is by taking random samples and random features to train many trees, and then making predictions based on a majority vote. This is called Random Forest, which is a surprisingly effective technique (I won’t cover it here).

Bag vs Boost

Another method of combining small trees is to create a chain of small trees, where each tree tries to correct the errors made by the previous ones. This technique is called boosting, and it turns out to be quite effective in learning highly complex data. In this article, I would like to dive deep into boosting by looking at an example known as Gradient Boost.

Gradient Boost

The algorithm for gradient boost can be defined as follows:

Input: Data (xi,yi)I=1n{(x_i, y_i)}_{I=1}^{n} and a differentiable loss function L(y,F(x))L(y, F(x)).

Step 1: Initialize model with a constant value: F0(x)=arg minγI=1nL(yi,γ)F_0(x) = \argmin_{\gamma} \sum_{I=1}^{n} L(y_i, \gamma)

Step 2: for m=1m=1 to MM:

  • Compute rim=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)} for i=1,,ni = 1,…,n.
  • Fit a tree to the rimr_{im} values and create terminal regions RjmR_{jm} for j=1,,Jmj = 1, … , J_m.
  • For j=1,,Jmj = 1, … , J_m compute γjm=arg minγxiRijL(yi,Fm1(xi)+γ)\gamma_{jm} = \argmin_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma)
  • Update Fm(x)=Fm1(x)+νj=1JmγmI(xRjm)F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} \gamma_m I(x \in R_{jm})

In this context, xix_i and yiy_i represent explanatory variables and predictor variable for each data point ii. We start with a constant initial prediction F0(x)F_0(x) and generate MM number of trees that fit to the residuals of previous prediction rimr_{im}. (The mechanism for fitting a tree is covered in a previous article on decision trees.)

The trees are chained together by adding the product of the output values of the tree and the learning rate ν\nu. Typically, MM is set to ~100, ν\nu is set to a value between 0 and 1, and the trees are constructed with 8 to 32 leaf nodes. It has been empirically tested that gradient boost works best when many small trees are combined in this manner.

You might notice how the algorithm is similar to gradient descent, in that we gradually update the model's parameters by taking the gradient/residual between the true and predicted values and multiplying it by the learning rate. It turns out that this similar mechanism also works well with trees.

Regression

Let’s see how gradient boost works in regression. While the algorithm may seem complex, it is surprisingly simple to apply it to regression problems. First, let’s set the loss function to be half of the squared residuals.

L(y,F(x))=12(yF(x))2 L(y, F(x)) = \frac{1}{2} (y-F(x))^2

We use half of the squared residuals because the constant 2 will cancel out when taking the derivatives to find the γ\gamma that minimizes the loss. Now, let’s move to step 1, where we set the initial prediction F0(x)F_0(x) such that it satisfies the following:

F0(x)=arg minγI=1nL(yi,γ) F_0(x) = \argmin_{\gamma} \sum_{I=1}^{n} L(y_i, \gamma)

We can take the derivative of the above with respect to γ\gamma, set it equal to 0, and solve the equation to get the γ\gamma that minimises the sum of losses.

γL(y,γ)=(yγ)γI=1nL(yi,γ)=I=1nγyi=nγI=1nyinγI=1nyi=0γ=I=1nyin \frac{\partial}{\partial \gamma} L(y, \gamma) = -(y-\gamma) \\ \frac{\partial}{\partial \gamma} \sum_{I=1}^{n} L(y_i, \gamma) = \sum_{I=1}^{n} \gamma - y_i \\ = n \gamma - \sum_{I=1}^{n} y_i \\ n \gamma - \sum_{I=1}^{n} y_i = 0 \\ \gamma = \frac{\sum_{I=1}^{n} y_i}{n}

Hence, the initial prediction F0(x)F_0(x) is just the average of the predictor variable:

F0(x)=I=1nyin(=yˉ) F_0(x) = \frac{\sum_{I=1}^{n} y_i}{n} (= \bar{y})

Now, we are ready to move to step 2. First, we need to compute the residuals rimr_{im}:

rim=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)=(Fm1(xi)yi)=yiFm1(xi) r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)} \\ = -(F_{m-1}(x_i) - y_i) = y_i - F_{m-1}(x_i)

We can fit a new tree mm to the residuals computed above. Then, for each leaf node, we compute γ\gamma using the following.

γjm=arg minγxiRijL(yi,Fm1(xi)+γ) \gamma_{jm} = \argmin_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma)

We can take the derivative of the sum of losses with respect to γ\gamma to find the γ\gamma that leads to the least sum of losses:

L(y,Fm1(x)+γ)=12(y(Fm1(x)+γ))2γL=(Fm1(x)+γ)yγxiRijL(yi,Fm1(xi)+γ)=xiRij(Fm1(xi)+γ)yi=I(xiRij)γ+xiRijFm1(xi)yiI(xiRij)γjm+xiRijFm1(xi)yi=0γjm=xiRijyiFm1(xi)I(xiRij) L(y, F_{m-1} (x) + \gamma) = \frac{1}{2} (y-(F_{m-1}(x)+\gamma))^2 \\ \frac{\partial}{\partial \gamma} L = (F_{m-1}(x)+\gamma) - y \\ \frac{\partial}{\partial \gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma) = \sum_{x_i \in R_{ij}} (F_{m-1}(x_i)+\gamma) - y_i \\ = I(x_i \in R_{ij})\gamma + \sum_{x_i \in R_{ij}} F_{m-1}(x_i) - y_i \\ I(x_i \in R_{ij}) \gamma_{jm} + \sum_{x_i \in R_{ij}} F_{m-1}(x_i) - y_i = 0 \\ \gamma_{jm} = \frac{\sum_{x_i \in R_{ij}} y_i - F_{m-1}(x_i)}{I(x_i \in R_{ij}) }

From the above, we can see that γjm\gamma_{jm} is always the average residual in RjmR_{jm}. We can use these γ\gammas as output values of leaf nodes in each tree. Then, the result from the new tree is multiplied by the learning rate ν\nu to create a new Fm(x)F_m(x).

Thus, the steps can be rewritten as follows:

Input: Data (xi,yi)I=1n{(x_i, y_i)}_{I=1}^{n} and a differentiable loss function, half of the squared residuals.

Step 1: Initialize model with a constant value, F0(x)F_0(x) as the average value of yy.

Step 2: for m=1m=1 to MM:

  • Compute the pseudo residual rim=yiFm1(x)r_{im} = y_i - F_{m-1}(x) for i=1,,ni = 1,…,n.
  • Fit a tree to the rimr_{im} values and create terminal regions RjmR_{jm} for j=1,,Jmj = 1, … , J_m.
  • For j=1,,Jmj = 1, … , J_m compute γjm\gamma_{jm} as average residual in RjmR_{jm}.
  • Update Fm(x)=Fm1(x)+νj=1JmγmI(xRjm)F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} \gamma_m I(x \in R_{jm})

Binary Classification

The strategy for gradient boost in binary classification is similar to that in regression, but there are a few important details to note. Just like in logistic regression, the tree will fit to the log-odds by minimizing the negative log-likelihood:

p=eF(x)1+eF(x)F(x)=log(p1p)L(yi,F(x))=(yilog(p)+(1yi)log(1p)) p = \frac{e^{F(x)}}{1+e^{F(x)}} \\ F(x) = log(\frac{p}{1-p}) \\ L(y_i, F(x)) = - (y_i log(p) + (1-y_i) log(1-p))

Let’s express the negative log-likelihood with respect to F(x)F(x), the predicted log-odds, instead of pp.

L(yi,F(x))=(yilog(p)+(1yi)log(1p))=(yi(log(p)log(1p))+log(1p))=(yiF(x)+log(1p))=(yiF(x)+log(1eF(x)1+eF(x)))=(yiF(x)+log(111+eF(x)))=(yiF(x)log(1+eF(x)))=yiF(x)+log(1+eF(x)) L(y_i, F(x)) = - (y_i log(p) + (1-y_i) log(1-p)) \\ = - (y_i (log(p) - log(1-p)) + log(1-p)) \\ = - (y_i F(x) + log(1-p)) \\ = - (y_i F(x) + log(1-\frac{e^{F(x)}}{1+e^{F(x)}})) \\ = - (y_i F(x) + log(1-\frac{1}{1+e^{F(x)}})) \\ = - (y_i F(x) - log(1+e^{F(x)})) \\ = - y_i F(x) + log(1+e^{F(x)})

For step 1, we need to take the derivative of the loss function L(yi,γ)L(y_i, \gamma) with respect to γ\gamma.

L(yi,γ)=yiγ+log(1+eγ)γL(yi,γ)=yi+eγ1+eγ=yi+pγ L(y_i, \gamma) = - y_i \gamma + log(1+e^{\gamma}) \\ \frac{\partial}{\partial \gamma} L(y_i, \gamma) = -y_i + \frac{e^{\gamma}}{1+e^{\gamma}} \\ = -y_i + p_{\gamma}

pγp_\gamma is the probability associated with the log-odds, γ\gamma. Then, we want to minimize the sum of losses as follows.

γi=1nL(yi,γ)=i=1nyi+pγ=npγi=1nyinpγi=1nyi=0pγ=i=1nyin \frac{\partial}{\partial \gamma} \sum_{i=1}^{n} L(y_i, \gamma) = \sum_{i=1}^{n} -y_i + p_{\gamma} \\ = n p_{\gamma} - \sum_{i=1}^{n} y_i \\ n p_{\gamma} - \sum_{i=1}^{n} y_i = 0 \\ p_{\gamma} = \frac{\sum_{i=1}^{n} y_i}{n}

We can see that the probability that achieves the minimum sum of losses is given by the probability of class being 1. Let’s compute the γ\gamma from pγp_{\gamma} using logit function.

γ=log(pγ1pγ)=log(i=1nyin1i=1nyin)=log(i=1nyii=1n1yi) \gamma = log(\frac{p_{\gamma}}{1-p_{\gamma}}) \\ = log(\frac{\frac{\sum_{i=1}^{n} y_i}{n}}{1-\frac{\sum_{i=1}^{n} y_i}{n}}) \\ = log(\frac{\sum_{i=1}^{n} y_i}{\sum_{i=1}^{n} 1-y_i})

The above reveals that the log-odds of the class being 1 is the initial prediction of log-odds, F0(x)F_0(x) or γ\gamma, that minimizes the negative log-likelihood loss. For step 2, we need to calculate rjmr_{jm}:

rim=[L(yi,F(xi))F(xi)]F(x)=Fm1(x)=(yi+eFm1(xi)1+eFm1(xi))=yipFm1(xi) r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)} \\ = -(-y_i + \frac{e^{F_{m-1}(x_i)}}{1+e^{F_{m-1}(x_i)}}) \\ = y_i - p_{F_{m-1}(x_i)}

We can see that we need to fit the tree to the pseudo-residuals between the class and the predicted probability from the previous prediction. Then, we need to compute the output values, γ\gammas, to update our prediction. The output value needs to be the value that minimizes the sum of losses.

γjm=arg minγxiRijL(yi,Fm1(xi)+γ)=arg minγxiRijyi(Fm1(xi)+γ)+log(1+eFm1(xi)+γ) \gamma_{jm} = \argmin_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma) \\ = \argmin_{\gamma} \sum_{x_i \in R_{ij}} - y_i (F_{m-1} (x_i) + \gamma) + log(1+e^{F_{m-1} (x_i) + \gamma})

However, taking the derivative of the above and solving for γ\gamma will be extremely hard, we approximate the derivative of the above with the second-order Taylor polynomial. (The reason why the second-order Taylor polynomial approximation is a good approximation is beyond the scope of this article. You just need to know that the approximation works by setting the intercept, first-order derivative, and second-order derivative (value, slope, concavity) to be the same. If you want to learn more about it, I recommend checking out Taylor series | Chapter 11, Essence of calculus from 3Blue1Brown.)

L(yi,Fm1(xi)+γ)L(yi,Fm1(xi))+ddFL(yi,Fm1(xi))γ+12ddF2L(yi,Fm1(xi))γ2ddγL(yi,Fm1(xi)+γ)ddFL(yi,Fm1(xi))+ddF2L(yi,Fm1(xi))γ L(y_i, F_{m-1} (x_i) + \gamma) \approx L(y_i, F_{m-1} (x_i)) + \frac{d}{dF} L(y_i, F_{m-1} (x_i))\gamma + \frac{1}{2} \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))\gamma^2 \\ \frac{d}{d\gamma} L(y_i, F_{m-1} (x_i) + \gamma) \approx \frac{d}{dF} L(y_i, F_{m-1} (x_i)) + \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))\gamma \\

Setting the above equal to zero, we can solve for γ\gamma:

ddFL(yi,Fm1(xi))+ddF2L(yi,Fm1(xi))γ=0γ=ddFL(yi,Fm1(xi))ddF2L(yi,Fm1(xi))=yipFm1(xi)ddF(yi+eFm1(xi)1+eFm1(xi)) \frac{d}{dF} L(y_i, F_{m-1} (x_i)) + \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))\gamma = 0 \\ \gamma = \frac{-\frac{d}{dF} L(y_i, F_{m-1} (x_i))}{\frac{d}{dF^2} L(y_i, F_{m-1} (x_i))} \\ = \frac{y_i - p_{F_{m-1}(x_i)}}{\frac{d}{dF} (-y_i + \frac{e^{F_{m-1}(x_i)}}{1+e^{F_{m-1}(x_i)}})}

We notice that the numerator is the residual we fitted using the tree. We just need to divide that by the derivative of the residual with respect to FF. To do so, we can express the last term as a product and use the product rule as follows:

ddF(yi+eFm1(xi)1+eFm1(xi))=ddF(yi+eFm1(xi)(1+eFm1(xi))1)=(1+eFm1(xi))2e2Fm1(xi)+(1+eFm1(xi))1eFm1(xi)=e2Fm1(xi)(1+eFm1(xi))2+eFm1(xi)(1+eFm1(xi))=e2Fm1(xi)(1+eFm1(xi))2+eFm1(xi)+e2Fm1(xi)(1+eFm1(xi))2=eFm1(xi)(1+eFm1(xi))2=eFm1(xi)(1+eFm1(xi))1(1+eFm1(xi))=eFm1(xi)(1+eFm1(xi))(1eFm1(xi)(1+eFm1(xi)))=pFm1(xi)(1pFm1(xi)) \frac{d}{dF} (-y_i + \frac{e^{F_{m-1}(x_i)}}{1+e^{F_{m-1}(x_i)}}) = \frac{d}{dF} (-y_i + e^{F_{m-1}(x_i)}(1+e^{F_{m-1}(x_i)})^{-1}) \\ = -(1+e^{F_{m-1}(x_i)})^{-2}e^{2F_{m-1}(x_i)}+(1+e^{F_{m-1}(x_i)})^{-1}e^{F_{m-1}(x_i)} \\ = -\frac{e^{2F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}}+\frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})} \\ = -\frac{e^{2F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}}+\frac{e^{F_{m-1}(x_i)}+e^{2F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}} \\ = \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}} \\ = \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})} \frac{1}{(1+e^{F_{m-1}(x_i)})} \\ = \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})} (1 - \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})}) \\ = p_{F_{m-1}(x_i)} (1-p_{F_{m-1}(x_i)})

The denominator turns out to be the product of previous predicted probability pFm1(xi)p_{F_{m-1}(x_i)} and 1pFm1(xi)1 - p_{F_{m-1}(x_i)}. You can carry out more maths and find out that when taking the derivative of the sum approximated L(yi,Fm1(xi)+γ)L(y_i, F_{m-1} (x_i) + \gamma) in RjmR_{jm}, we get the sum of residuals divided by the sum of the product of pFm1(xi)p_{F_{m-1}(x_i)} and 1pFm1(xi)1 - p_{F_{m-1}(x_i)}. Thus, the value of γjm\gamma_{jm} that minimizes the sum of losses for the region RjmR_{jm} is:

γjm=yipFm1(xi)pFm1(xi)(1pFm1(xi)) \gamma_{jm} = \frac{\sum y_i - p_{F_{m-1}(x_i)}}{\sum p_{F_{m-1}(x_i)} (1-p_{F_{m-1}(x_i)})}

We can compute the above output for each region and update the Fm(x)F_m(x) with the learning rate ν\nu.

From the above, the steps can be rewritten as follows:

Input: Data (xi,yi)I=1n{(x_i, y_i)}_{I=1}^{n} and a differentiable loss function, negative log-likelihood.

Step 1: Initialize model with a constant value, F0(x)F_0(x) as the log-odds of yy.

Step 2: for m=1m=1 to MM:

  • Compute the pseudo residual rim=yipFm1(x)r_{im} = y_i - p_{F_{m-1}(x)} for i=1,,ni = 1,…,n.
  • Fit a tree to the rimr_{im} values and create terminal regions RjmR_{jm} for j=1,,Jmj = 1, … , J_m.
  • For j=1,,Jmj = 1, … , J_m compute γjm\gamma_{jm} as sum of residuals divided by the sum of pFm1(x)(1pFm1(x))p_{F_{m-1}(x)}(1-p_{F_{m-1}(x)}) in RjmR_{jm}.
  • Update Fm(x)=Fm1(x)+νj=1JmγmI(xRjm)F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} \gamma_m I(x \in R_{jm})

The logic behind gradient boosting for multi-class classification is similar to what we saw above. We use the softmax function instead of the logistic function to map the probability to the log-odds and fit the trees to the log-odds using the same methods. As the article is getting quite lengthy, I will leave the multi-class classification and code implementation to you. You can use GradientBoostRegressor and GradientBoostClassifierfrom sklaern library to apply the above techniques.

Pseudo-Residual for Binary Classification

When we computed the derivative of the approximation of the loss function with respect to γ\gamma, we arrived at the residuals divided by the product of the previously predicted probability and 1 minus the previously predicted probability. It makes sense mathematically, but how does it impact the value of pseudo-residuals? To understand the impact, let’s plot the function p(1p)p(1-p).

As you can see from the above, it has a nice curve that results in 0 when p=0p=0 or 11, and has a maximum point at p=0.5p=0.5. Since it is in the denominator, the pseudo-residuals will be higher when the previous prediction was close to extreme values like 0 or 1, while they will be relatively smaller when the previous prediction is around 0.5. We can see that this aligns with how we want the model to learn from the data. We want the model to adjust quickly when the prediction is extreme and on the opposite side, while being cautious when approaching extreme predictions when the current prediction is around 0.5, in order to avoid overfitting. Thus, it makes sense to divide the residual by p(1p)p(1-p) conceptually as well.

Resources