Road to ML Engineer #10 - Gradient Boost

While there are many mechanisms, such as pruning, to prevent trees from overfitting, decision trees still tend to struggle with high-dimensional and complex data. This happens because the trees inevitably need to become more complex in order to model such data accurately, without underfitting.

Bagging & Boosting

How do we maintain the simplicity of the model while ensuring it can fit well on high-dimensional and complex data? One solution to this problem is bagging, where we build many simple trees and combine their results to make a prediction. The most intuitive way to do this is by taking random samples and random features to train many trees, and then making predictions based on a majority vote. This is called Random Forest, which is a surprisingly effective technique (I won’t cover it here).

Another method of combining small trees is to create a chain of small trees, where each tree tries to correct the errors made by the previous ones. This technique is called boosting, and it turns out to be quite effective in learning highly complex data. In this article, I would like to dive deep into boosting by looking at an example known as Gradient Boost.

Gradient Boost

The algorithm for gradient boost can be defined as follows:

Input: Data ${(x_i, y_i)}_{I=1}^{n}$ and a differentiable loss function $L(y, F(x))$ .

Step 1: Initialize model with a constant value: $F_0(x) = \argmin_{\gamma} \sum_{I=1}^{n} L(y_i, \gamma)$

Step 2: for $m=1$ to $M$ :

Compute $r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)}$ for $i = 1,…,n$ .
Fit a tree to the $r_{im}$ values and create terminal regions $R_{jm}$ for $j = 1, … , J_m$ .
For $j = 1, … , J_m$ compute $\gamma_{jm} = \argmin_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma)$
Update $F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} \gamma_m I(x \in R_{jm})$

In this context, $x_i$ and $y_i$ represent explanatory variables and predictor variable for each data point $i$ . We start with a constant initial prediction $F_0(x)$ and generate $M$ number of trees that fit to the residuals of previous prediction $r_{im}$ . (The mechanism for fitting a tree is covered in a previous article on decision trees.)

The trees are chained together by adding the product of the output values of the tree and the learning rate $\nu$ . Typically, $M$ is set to ~100, $\nu$ is set to a value between 0 and 1, and the trees are constructed with 8 to 32 leaf nodes. It has been empirically tested that gradient boost works best when many small trees are combined in this manner.

You might notice how the algorithm is similar to gradient descent, in that we gradually update the model's parameters by taking the gradient/residual between the true and predicted values and multiplying it by the learning rate. It turns out that this similar mechanism also works well with trees.

Regression

Let’s see how gradient boost works in regression. While the algorithm may seem complex, it is surprisingly simple to apply it to regression problems. First, let’s set the loss function to be half of the squared residuals.

L(y, F(x)) = \frac{1}{2} (y-F(x))^2

We use half of the squared residuals because the constant 2 will cancel out when taking the derivatives to find the $\gamma$ that minimizes the loss. Now, let’s move to step 1, where we set the initial prediction $F_0(x)$ such that it satisfies the following:

F_0(x) = \argmin_{\gamma} \sum_{I=1}^{n} L(y_i, \gamma)

We can take the derivative of the above with respect to $\gamma$ , set it equal to 0, and solve the equation to get the $\gamma$ that minimises the sum of losses.

\frac{\partial}{\partial \gamma} L(y, \gamma) = -(y-\gamma) \\ \frac{\partial}{\partial \gamma} \sum_{I=1}^{n} L(y_i, \gamma) = \sum_{I=1}^{n} \gamma - y_i \\ = n \gamma - \sum_{I=1}^{n} y_i \\ n \gamma - \sum_{I=1}^{n} y_i = 0 \\ \gamma = \frac{\sum_{I=1}^{n} y_i}{n}

Hence, the initial prediction $F_0(x)$ is just the average of the predictor variable:

F_0(x) = \frac{\sum_{I=1}^{n} y_i}{n} (= \bar{y})

Now, we are ready to move to step 2. First, we need to compute the residuals $r_{im}$ :

r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)} \\ = -(F_{m-1}(x_i) - y_i) = y_i - F_{m-1}(x_i)

We can fit a new tree $m$ to the residuals computed above. Then, for each leaf node, we compute $\gamma$ using the following.

\gamma_{jm} = \argmin_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma)

We can take the derivative of the sum of losses with respect to $\gamma$ to find the $\gamma$ that leads to the least sum of losses:

L(y, F_{m-1} (x) + \gamma) = \frac{1}{2} (y-(F_{m-1}(x)+\gamma))^2 \\ \frac{\partial}{\partial \gamma} L = (F_{m-1}(x)+\gamma) - y \\ \frac{\partial}{\partial \gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma) = \sum_{x_i \in R_{ij}} (F_{m-1}(x_i)+\gamma) - y_i \\ = I(x_i \in R_{ij})\gamma + \sum_{x_i \in R_{ij}} F_{m-1}(x_i) - y_i \\ I(x_i \in R_{ij}) \gamma_{jm} + \sum_{x_i \in R_{ij}} F_{m-1}(x_i) - y_i = 0 \\ \gamma_{jm} = \frac{\sum_{x_i \in R_{ij}} y_i - F_{m-1}(x_i)}{I(x_i \in R_{ij}) }

From the above, we can see that $\gamma_{jm}$ is always the average residual in $R_{jm}$ . We can use these $\gamma$ s as output values of leaf nodes in each tree. Then, the result from the new tree is multiplied by the learning rate $\nu$ to create a new $F_m(x)$ .

Thus, the steps can be rewritten as follows:

Input: Data ${(x_i, y_i)}_{I=1}^{n}$ and a differentiable loss function, half of the squared residuals.

Step 1: Initialize model with a constant value, $F_0(x)$ as the average value of $y$ .

Step 2: for $m=1$ to $M$ :

Compute the pseudo residual $r_{im} = y_i - F_{m-1}(x)$ for $i = 1,…,n$ .
Fit a tree to the $r_{im}$ values and create terminal regions $R_{jm}$ for $j = 1, … , J_m$ .
For $j = 1, … , J_m$ compute $\gamma_{jm}$ as average residual in $R_{jm}$ .
Update $F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} \gamma_m I(x \in R_{jm})$

Binary Classification

The strategy for gradient boost in binary classification is similar to that in regression, but there are a few important details to note. Just like in logistic regression, the tree will fit to the log-odds by minimizing the negative log-likelihood:

p = \frac{e^{F(x)}}{1+e^{F(x)}} \\ F(x) = log(\frac{p}{1-p}) \\ L(y_i, F(x)) = - (y_i log(p) + (1-y_i) log(1-p))

Let’s express the negative log-likelihood with respect to $F(x)$ , the predicted log-odds, instead of $p$ .

L(y_i, F(x)) = - (y_i log(p) + (1-y_i) log(1-p)) \\ = - (y_i (log(p) - log(1-p)) + log(1-p)) \\ = - (y_i F(x) + log(1-p)) \\ = - (y_i F(x) + log(1-\frac{e^{F(x)}}{1+e^{F(x)}})) \\ = - (y_i F(x) + log(1-\frac{1}{1+e^{F(x)}})) \\ = - (y_i F(x) - log(1+e^{F(x)})) \\ = - y_i F(x) + log(1+e^{F(x)})

For step 1, we need to take the derivative of the loss function $L(y_i, \gamma)$ with respect to $\gamma$ .

L(y_i, \gamma) = - y_i \gamma + log(1+e^{\gamma}) \\ \frac{\partial}{\partial \gamma} L(y_i, \gamma) = -y_i + \frac{e^{\gamma}}{1+e^{\gamma}} \\ = -y_i + p_{\gamma}

$p_\gamma$ is the probability associated with the log-odds, $\gamma$ . Then, we want to minimize the sum of losses as follows.

\frac{\partial}{\partial \gamma} \sum_{i=1}^{n} L(y_i, \gamma) = \sum_{i=1}^{n} -y_i + p_{\gamma} \\ = n p_{\gamma} - \sum_{i=1}^{n} y_i \\ n p_{\gamma} - \sum_{i=1}^{n} y_i = 0 \\ p_{\gamma} = \frac{\sum_{i=1}^{n} y_i}{n}

We can see that the probability that achieves the minimum sum of losses is given by the probability of class being 1. Let’s compute the $\gamma$ from $p_{\gamma}$ using logit function.

\gamma = log(\frac{p_{\gamma}}{1-p_{\gamma}}) \\ = log(\frac{\frac{\sum_{i=1}^{n} y_i}{n}}{1-\frac{\sum_{i=1}^{n} y_i}{n}}) \\ = log(\frac{\sum_{i=1}^{n} y_i}{\sum_{i=1}^{n} 1-y_i})

The above reveals that the log-odds of the class being 1 is the initial prediction of log-odds, $F_0(x)$ or $\gamma$ , that minimizes the negative log-likelihood loss. For step 2, we need to calculate $r_{jm}$ :

r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)} \\ = -(-y_i + \frac{e^{F_{m-1}(x_i)}}{1+e^{F_{m-1}(x_i)}}) \\ = y_i - p_{F_{m-1}(x_i)}

We can see that we need to fit the tree to the pseudo-residuals between the class and the predicted probability from the previous prediction. Then, we need to compute the output values, $\gamma$ s, to update our prediction. The output value needs to be the value that minimizes the sum of losses.

\gamma_{jm} = \argmin_{\gamma} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + \gamma) \\ = \argmin_{\gamma} \sum_{x_i \in R_{ij}} - y_i (F_{m-1} (x_i) + \gamma) + log(1+e^{F_{m-1} (x_i) + \gamma})

However, taking the derivative of the above and solving for $\gamma$ will be extremely hard, we approximate the derivative of the above with the second-order Taylor polynomial. (The reason why the second-order Taylor polynomial approximation is a good approximation is beyond the scope of this article. You just need to know that the approximation works by setting the intercept, first-order derivative, and second-order derivative (value, slope, concavity) to be the same. If you want to learn more about it, I recommend checking out Taylor series | Chapter 11, Essence of calculus from 3Blue1Brown.)

L(y_i, F_{m-1} (x_i) + \gamma) \approx L(y_i, F_{m-1} (x_i)) + \frac{d}{dF} L(y_i, F_{m-1} (x_i))\gamma + \frac{1}{2} \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))\gamma^2 \\ \frac{d}{d\gamma} L(y_i, F_{m-1} (x_i) + \gamma) \approx \frac{d}{dF} L(y_i, F_{m-1} (x_i)) + \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))\gamma \\

Setting the above equal to zero, we can solve for $\gamma$ :

\frac{d}{dF} L(y_i, F_{m-1} (x_i)) + \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))\gamma = 0 \\ \gamma = \frac{-\frac{d}{dF} L(y_i, F_{m-1} (x_i))}{\frac{d}{dF^2} L(y_i, F_{m-1} (x_i))} \\ = \frac{y_i - p_{F_{m-1}(x_i)}}{\frac{d}{dF} (-y_i + \frac{e^{F_{m-1}(x_i)}}{1+e^{F_{m-1}(x_i)}})}

We notice that the numerator is the residual we fitted using the tree. We just need to divide that by the derivative of the residual with respect to $F$ . To do so, we can express the last term as a product and use the product rule as follows:

\frac{d}{dF} (-y_i + \frac{e^{F_{m-1}(x_i)}}{1+e^{F_{m-1}(x_i)}}) = \frac{d}{dF} (-y_i + e^{F_{m-1}(x_i)}(1+e^{F_{m-1}(x_i)})^{-1}) \\ = -(1+e^{F_{m-1}(x_i)})^{-2}e^{2F_{m-1}(x_i)}+(1+e^{F_{m-1}(x_i)})^{-1}e^{F_{m-1}(x_i)} \\ = -\frac{e^{2F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}}+\frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})} \\ = -\frac{e^{2F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}}+\frac{e^{F_{m-1}(x_i)}+e^{2F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}} \\ = \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})^{2}} \\ = \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})} \frac{1}{(1+e^{F_{m-1}(x_i)})} \\ = \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})} (1 - \frac{e^{F_{m-1}(x_i)}}{(1+e^{F_{m-1}(x_i)})}) \\ = p_{F_{m-1}(x_i)} (1-p_{F_{m-1}(x_i)})

The denominator turns out to be the product of previous predicted probability $p_{F_{m-1}(x_i)}$ and $1 - p_{F_{m-1}(x_i)}$ . You can carry out more maths and find out that when taking the derivative of the sum approximated $L(y_i, F_{m-1} (x_i) + \gamma)$ in $R_{jm}$ , we get the sum of residuals divided by the sum of the product of $p_{F_{m-1}(x_i)}$ and $1 - p_{F_{m-1}(x_i)}$ . Thus, the value of $\gamma_{jm}$ that minimizes the sum of losses for the region $R_{jm}$ is:

\gamma_{jm} = \frac{\sum y_i - p_{F_{m-1}(x_i)}}{\sum p_{F_{m-1}(x_i)} (1-p_{F_{m-1}(x_i)})}

We can compute the above output for each region and update the $F_m(x)$ with the learning rate $\nu$ .

From the above, the steps can be rewritten as follows:

Input: Data ${(x_i, y_i)}_{I=1}^{n}$ and a differentiable loss function, negative log-likelihood.

Step 1: Initialize model with a constant value, $F_0(x)$ as the log-odds of $y$ .

Step 2: for $m=1$ to $M$ :

Compute the pseudo residual $r_{im} = y_i - p_{F_{m-1}(x)}$ for $i = 1,…,n$ .
Fit a tree to the $r_{im}$ values and create terminal regions $R_{jm}$ for $j = 1, … , J_m$ .
For $j = 1, … , J_m$ compute $\gamma_{jm}$ as sum of residuals divided by the sum of $p_{F_{m-1}(x)}(1-p_{F_{m-1}(x)})$ in $R_{jm}$ .
Update $F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} \gamma_m I(x \in R_{jm})$

The logic behind gradient boosting for multi-class classification is similar to what we saw above. We use the softmax function instead of the logistic function to map the probability to the log-odds and fit the trees to the log-odds using the same methods. As the article is getting quite lengthy, I will leave the multi-class classification and code implementation to you. You can use GradientBoostRegressor and GradientBoostClassifierfrom sklaern library to apply the above techniques.

Pseudo-Residual for Binary Classification

When we computed the derivative of the approximation of the loss function with respect to $\gamma$ , we arrived at the residuals divided by the product of the previously predicted probability and 1 minus the previously predicted probability. It makes sense mathematically, but how does it impact the value of pseudo-residuals? To understand the impact, let’s plot the function $p(1-p)$ .

As you can see from the above, it has a nice curve that results in 0 when $p=0$ or $1$ , and has a maximum point at $p=0.5$ . Since it is in the denominator, the pseudo-residuals will be higher when the previous prediction was close to extreme values like 0 or 1, while they will be relatively smaller when the previous prediction is around 0.5. We can see that this aligns with how we want the model to learn from the data. We want the model to adjust quickly when the prediction is extreme and on the opposite side, while being cautious when approaching extreme predictions when the current prediction is around 0.5, in order to avoid overfitting. Thus, it makes sense to divide the residual by $p(1-p)$ conceptually as well.

Resources

StatQuest with Josh Starmer. 2019. Gradient Boost Part 2 (of 4): Regression Details. YouTube.
StatQuest with Josh Starmer. 2019. Gradient Boost Part 4 (of 4): Classification Details. YouTube.