Road to ML Engineer #7 - Regularization

Last Edited: 8/6/2024

The blog post introduces an important concept of regularization in machine learning.

ML

Ridge Regression

In an extreme case where there are only two training data points, we can always draw a straight line that has zero residuals and achieve zero bias.

However, as the above shows, having zero bias often leads to high variance and overfitting due to the bias-variance tradeoff. (If you are not sure about the bias-variance tradeoff, check out the last article on cross-validation.) How can we avoid overfitting? One way is to use Ridge Regression or L2 Regularization. In Ridge regression, we add a ridge regression penalty to the cost function.

J(ϕ)=MSE+λϕs2 J(\phi) = MSE + \lambda \sum \phi_s^2

, where ϕs\phi_s are the slopes of the linear function. By adding the square of the slope, we can penalize it when slope is steep (even if the slope is negative), effectively making the prediction less sensitive to the training data. λ\lambda is the regularization rate, which allows us to control how much penality should be added. This is one of the hyperparameters that we can tune with cross-validation.

Lasso Regression

Instead of adding squares, we can use the absolute value so that we can penalize it when the slope is steep.

J(ϕ)=MSE+λϕs J(\phi) = MSE + \lambda \sum |\phi_s|

This is called Lasso Regression or L1 Regularization.

L1 vs L2 Regularization

You might think it does not matter which regularization technique to use, but there is a reason that I introduced both here. When training the model, we compute the gradient and subtract that graident from the parameters. Hence, let's compute the gradient for slopes in Ridge and Lasso regression. For Ridge regression (L2), the gradient looks like this:

ϕJ=MSEϕ+2λϕ \frac{\partial}{\partial \phi} J = \frac{\partial MSE}{\partial \phi} + 2 \lambda \phi

For Lasso regression (L1), it looks like the following:

ϕJ=MSEϕ+λ \frac{\partial}{\partial \phi} J = \frac{\partial MSE}{\partial \phi} + \lambda

They are very similar, but we can see that the penalty term is different between L1 and L2 regularization. In L1 regularization, we subtract λ\lambda regardless of the value of the slope. Hence, it encourages the slope to be zero. However, in L2 regularization, we subtract 2λϕ2 \lambda \phi, meaning the penalty term gets smaller as the value of the slope gets smaller. Thus, while L2 regularization encourages slopes to be close to zero, it does not make the slope zero.

Therefore, when we are sure that some of the explanatory variables are completely useless and want to automatically avoid them from the equation, we can use Lasso regression or L1 regularization, as it can make the slopes for those meaningless variables zero. However, when we know that all of the variables are meaningful, we can use Ridge regression or L2 regularization to keep all of the slopes non-zero values while preventing overfitting.

Elastic-Net Regression

While the distinction between L1 and L2 regularization can help us choose which one to use sometimes, there are some cases where we are not even sure if there are some explanatory variables that are redundant or not due to the sheer number of features. In that case, we can use Elastic-Net Regression that combines them into one like below.

J(ϕ)=MSE+λ1ϕs+λ2ϕs2 J(\phi) = MSE + \lambda_1 |\phi_s| + \lambda_2 \phi_s^2

By combining both L1 and L2, we can eliminate unnecessary explanatory variables while encoruaging the slopes to be smaller for other useful variables. The downside to this approach is that we have more hyperparameters to tune and more computation to perform.

Coding Implementation

Let's incorporate regularization feature in LinearRegressionGD.

class LinearRegressionGD():
  def __init__(self, lr=0.01, regularization="l1", alpha=0.01, beta=0.01):
    self.W = np.zeros(X.shape[1])
    self.b = 0
    self.lr = lr # Learning rate
    self.history = [] # History of loss
    self.regularization = regularization
    self.alpha = alpha # lambda (lambda_1 for elastic-net)
    self.beta = beta # lambda_2 for elastic-net
 
  def predict(self, X):
    return np.sum(self.W*X + self.b, axis=1)
 
  def gradient(self, X, y, pred):
      diff = pred - y
      grad_W = np.sum((1/n)*diff[:, np.newaxis]*X, axis=0)
      grad_b = np.sum((1/n)*diff)
 
      # Regularization
      if (self.regularzation == "l1"):
        grad_W += self.alpha
      elif (self.regularization == "l2"):
        grad_W += 2 * self.alpha * self.W
      elif (self.regularization == "elastic-net"):
        gradW += self.alpha + 2 * self.beta * self.W
 
      return grad_W, grad_b
        
  def fit(self, X, y, epochs=100):
    for i in range(epochs):
      pred = self.predict(X)
      n = len(y)
 
      self.history.append(mean_squared_error(y, pred))
 
      grad_W, grad_b = gradient(X, y, pred)
 
      self.W -= self.lr * grad_W
      self.b -= self.lr * grad_b
    return self.history

Using the above, you can use Ridge, Lasso, and Elastic-Net regressions like below.

l1 = LinearRegressionGD(regularization="l1", alpha=0.01)
l2 = LinearRegressionGD(regularization="l2", alpha=0.01)
en = LinearRegreessionGD(regularization="elastic-net", alpha=0.01, beta=0.01)

Then, you can use cross-validation for choosing the best model with lowest variance. I recommend you to try incorporate them in LogisticRegressionGD and SoftmaxRegressionGD too as coding exercises.

[Fun Fact] Bayesian Interpretation

I was fortunate to discover the video from ritvikmath (2021) titled Bayesian Linear Regression : Data Science Concepts on YouTube, which presents an interesting interpretation of Ridge and Lasso regression in terms of Bayesian statistics. In short, the video explains that MSE corresponds to the negative log-likelihood of observing data given the parameters, and the regularization terms for Ridge and Lasso regression correspond to the Gaussian prior and Laplacian prior, respectively. It is a brilliant video, and I highly recommend checking it out.

Resources