Road to ML Engineer #3 - Logistic Regression

Last Edited: 7/23/2024

The blog post introduces logistic regression.

ML

The logistic regression requires you to know the prerequisites covered in the previous article "Road to ML Engineer #2 - Logistic Regression Prerequisites." If you haven't read that article, make sure you check it out before proceeding with this one.

Step 1. Data Exploration

Here, we will use the same dataset we used for linear regression: the Iris dataset. This dataset contains 50 rows of data for three species in the Iris genus, namely Iris Setosa, Iris Versicolor, and Iris Virginica. To load the dataset, use the following code:

# Import Dataset from sklearn
from sklearn.datasets import load_iris
 
# Load Iris Data
iris = load_iris()
 
# Creating pd DataFrames
# The below is converting species category from numerical values to characters
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
target_df = pd.DataFrame(data= iris.target, columns= ['species'])
def converter(specie):
    if specie == 0:
        return 'setosa'
    elif specie == 1:
        return 'versicolor'
    else:
        return 'virginica'
target_df['species'] = target_df['species'].apply(converter)
 
# Concatenate the DataFrames
iris_df = pd.concat([iris_df, target_df], axis= 1)
 
# Display the data
iris_df

If you run the code above in Google Colab, you should see something like the following:

Iris data

Let's say our task is to classify Setosa and other species. As we did last time, let's visualize the data to gain some insights. To do so, you can use the following code:

sns.pairplot(iris_df, hue= 'species')
Iris distribution

You can observe that all of the features might have different distributions for different species. We can use this information to choose which variables to use for building a model.

Step 2. Data Preprocessing

Let's move on to data preprocessing based on the information we got from the previous step. We saw that species could be better classified if we use all the variables. Hence, we do not have to drop any columns from the data. However, we still have some work to do.

First, we need to convert the species variable back to a numerical variable. In doing so, we need to treat Versicolor and Virginica the same. Also, we need to isolate the species column.

iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
 
# Set Setosa = 1, Others = 0
def converter(specie):
    if specie == 0:
        return 1
    else:
      return 0
 
X = np.array(iris_df)
y = np.array(target_df['species'].apply(converter))

We tend to use X for explanatory variables and y for the outcome variable we want to classify. Additionally, we need to split the dataset into training data and testing data.

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

In the code above, we are splitting the dataset randomly so that the training data and testing data will have 67% and 33% of the original dataset, respectively.

Step 3. Model

Problem Definition

Before building the model, let's plot Setosa vs. Sepal Width to see what we are dealing with here.

Setosa vs Sepal Width

We can see how we can fit the logistic function to the data, as shown below.

Setosa vs Sepal Width with Logistic Function

If we can fit the logistic function well to the data, we can classify if the species is Setosa by plugging in the value of sepal width and comparing the estimated probability to the threshold (typically set to 0.5). How do we then fit the logistic function to the data?

Model Definition

Remember the equation of the logistic function:

f(x)=L1+ek(xx0) f(x) = \frac{L}{1+e^{-k(x-x_0)}}

In the case of only using sepal width, xx is sepal width. As we want to predict the probability of the species being Setosa or not, we can set LL to be 1. When we rearrange the equation, we can rewrite it as follows.

p(x)=11+eβ1xβ0 p(x) = \frac{1}{1+e^{-\beta_1x-\beta_0}}

If you recall the discussion of the logistic function, the term β1x+β0\beta_1x+\beta_0 corresponds to the log-odds. Thus, to fit the logistic function well to the data, we can simply fit

log(p(x)1p(x))=β1x+β0 log(\frac{p(x)}{1-p(x)}) = \beta_1x+\beta_0

by adjusting β\betas. This means we can simply take the linear regression of sepal width and log-odds of the species being Setosa to fit the logistic function to the data! When we use multiple explanatory variables, we can simply fit the following.

log(p(x)1p(x))=β1sl+β2sw+β3pl+β4pw+β0 log(\frac{p(x)}{1-p(x)}) = \beta_1sl+\beta_2sw+\beta_3pl+\beta_4pw+\beta_0

Cost Function

When we covered linear regression, we used MSE as a cost function to optimize for. However, we cannot use MSE here because the labels for Setosa and others are 0 and 1, whose log-odds are -\infty and \infty, respectively. Any set of β\betas will result in an MSE of \infty. Instead of MSE, we can use likelihood. If you remember from the last article, we can perform maximum likelihood estimation (MLE) to get the set of parameters that best approximate the model by which the data is generated. Let's determine the likelihood function for this situation.

The more samples' labels the model classifies correctly, the higher the likelihood of the model given the data should be. This means that for samples labeled as 1 (yi=1y_i = 1), the predicted probability of the label being 1 should be higher (p(xi)p(x_i) should be higher), and for samples labeled as 0 (yi=0y_i=0), the predicted probability of the label being 1 should be smaller ((1p(xi))(1-p(x_i)) should be higher) for all the samples for the likelihood to be higher. Hence, we can maximize the following:

L(β)=yi=1(p(xi))yi=0(1p(xi)) L(\beta) = \prod_{y_i=1}(p(x_i)) * \prod_{y_i=0}(1-p(x_i))

We can simplify the equation as follows:

L(β)=i=1n(p(xi))yi(1p(xi))1yi L(\beta) = \prod_{i=1}^{n}(p(x_i))^{y_i} * (1-p(x_i))^{1-y_i}

This is because when the label of a sample is 1, 1yi1-y_i becomes 0, and the whole term just becomes p(xi)p(x_i); when the label of a sample is 0, yiy_i becomes 0 while 1yi1-y_i becomes 1, and the whole term just becomes 1p(xi)1-p(x_i). Hence, we can maximize the above likelihood function as a cost function to evaluate the model.

However, you will need to take the derivative of the cost function with respect to the parameters β\betas to find the best set of parameters, and doing that for the above likelihood function is hard. Thus, we normally take the log of the above likelihood to further simplify the cost function. (It is allowed because the log is monotonic.)

log(L(β))=i=1nyilog(p(xi))+(1yi)log(1p(xi)) log(L(\beta)) = \sum_{i=1}^{n}y_ilog(p(x_i))+(1-y_i)log(1-p(x_i))

Also, we usually want to minimize the cost function instead of maximizing it. As maximizing the log-likelihood is the same as minimizing the negative log-likelihood, we use the negative log-likelihood as a cost function.

J(β)=i=1nyilog(p(xi))(1yi)log(1p(xi)) J(\beta) = \sum_{i=1}^{n}-y_ilog(p(x_i))-(1-y_i)log(1-p(x_i))

Haven’t you seen the above somewhere else? Yes, it is very similar to the cross-entropy function. In fact, the first term of the negative log-likelihood is the same as the cross-entropy between the sample labels and predicted probabilities!

H(y,p)=i=1nyilog(p(xi)) H(y, p) = -\sum_{i=1}^{n}y_ilog(p(x_i))

The last term can be thought of as cross-entropy for another category, so they are basically the same thing.

Learning Mechanism

There are multiple ways of performing maximum likelihood estimation or minimizing the cost function here, but let's use gradient descent. To implement gradient descent, we need to take the partial derivatives of the cost function with respect to the parameters β\betas. Let's first express the negative log-likelihood with respect to β\betas.

J(β)=i=1nyilog(11+eβxi)(1yi)log(111+eβxi)=i=1nyilog(11+eβxi)(1yi)log(eβxi1+eβxi)=i=1nyi[log(eβxi1+eβxi)log(11+eβxi)]log(eβxi1+eβxi)=i=1nyi[log(eβxi)log(1+eβxi)log(1)+log(1+eβxi)]log(eβxi1+eβxieβxieβxi)=i=1nyilog(eβxi)log(11+eβxi)=i=1nyiβxi+log(1+eβxi) J(\beta) = \sum_{i=1}^{n}-y_ilog(\frac{1}{1+e^{-\beta x_i}})-(1-y_i)log(1-\frac{1}{1+e^{-\beta x_i}}) \\ = \sum_{i=1}^{n}-y_ilog(\frac{1}{1+e^{-\beta x_i}})-(1-y_i)log(\frac{e^{-\beta x_i}}{1+e^{-\beta x_i}}) \\ = \sum_{i=1}^{n}y_i[log(\frac{e^{-\beta x_i}}{1+e^{-\beta x_i}})-log(\frac{1}{1+e^{-\beta x_i}})]-log(\frac{e^{-\beta x_i}}{1+e^{-\beta x_i}}) \\ = \sum_{i=1}^{n}y_i[log(e^{-\beta x_i})-log(1+e^{-\beta x_i})-log(1)+log(1+e^{-\beta x_i})]-log(\frac{e^{-\beta x_i}}{1+e^{-\beta x_i}}*\frac{e^{\beta x_i}}{e^{\beta x_i}}) \\ = \sum_{i=1}^{n}y_ilog(e^{-\beta x_i})-log(\frac{1}{1+e^{\beta x_i}}) \\ = \sum_{i=1}^{n}-y_i\beta x_i+log(1+e^{\beta x_i})

where βxi\beta x_i is an abbreviation of the linear equation for log-odds. Let's take the partial derivative of the above cost function with respect to the slopes and intercept using the chain rule. For slopes,

β=i=1nyixi+11+eβxieβxixi=i=1nyixi+eβxi1+eβxixi=i=1nyixi+11+eβxixi=i=1nyixi+p(xi)xi=i=1n(p(xi)yi)xi \frac{\partial}{\partial\beta} = \sum_{i=1}^{n}-y_ix_i+\frac{1}{1+e^{\beta x_i}}e^{\beta x_i}x_i \\ = \sum_{i=1}^{n}-y_ix_i+\frac{e^{\beta x_i}}{1+e^{\beta x_i}}x_i \\ = \sum_{i=1}^{n}-y_ix_i+\frac{1}{1+e^{-\beta x_i}}x_i \\ = \sum_{i=1}^{n}-y_ix_i+p(x_i)x_i \\ = \sum_{i=1}^{n}(p(x_i)-y_i)x_i

For intercept,

β0=i=1nyi+11+eβxieβxi=i=1nyi+eβxi1+eβxi=i=1nyi+11+eβxi=i=1np(xi)yi \frac{\partial}{\partial\beta_0} = \sum_{i=1}^{n}-y_i+\frac{1}{1+e^{\beta x_i}}e^{\beta x_i} \\ = \sum_{i=1}^{n}-y_i+\frac{e^{\beta x_i}}{1+e^{\beta x_i}} \\ = \sum_{i=1}^{n}-y_i+\frac{1}{1+e^{-\beta x_i}} \\ = \sum_{i=1}^{n}p(x_i)-y_i

We can see that we can calculate the partial derivatives of the intercept and slopes by just taking the sum of the differences between predicted and actual probabilities and the sum of differences between predicted and actual probabilities multiplied by the corresponding xx, respectively. This can make the code implementation very easy.

Code Implementation

By using the above information, let's create the model LogisticRegressionGD.

class LogisticRegressionGD():
  def __init__(self, lr=0.001):
    self.W = np.zeros(X.shape[1])
    self.b = 0
    self.lr = lr # Learning rate
    self.history = [] # History of loss
 
  def predict(self, X):
    log_odds = np.sum(self.W*X + self.b, axis=1)
    return 1 / (1 + np.exp(-log_odds))
 
  def fit(self, X, y, epochs=200):
    for i in range(epochs):
      pred = self.predict(X)
 
      self.history.append(log_loss(y, pred))
 
      diff = pred - y
      grad_W = np.sum(diff[:, np.newaxis]*X, axis=0)
      grad_b = np.sum(diff)
 
      self.W -= self.lr * grad_W
      self.b -= self.lr * grad_b
    return self.history

When you look closely at the code, you might notice that it is extremely similar to LinearRegressionGD. In fact, only the predict method, loss calculation, and gradients are different, while everything else is kept the same. The only difference in gradient calculations is that LogisticRegressionGD does not divide the gradient by the number of data points. How amazing is that!

You can initialize LogisticRegressionGD and fit it to the training data as follows:

lr = LogisticRegressionGD()
history = lr.fit(X_train, y_train, epochs=200)

As we keep track of the negative log-likelihood loss over each epoch, we can plot it to see how gradient descent is working.

plt.plot(history)
plt.title("Loss vs Epoch")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.show()
Iris GD plot

We can see from the above plot that the loss gradually decreases over each epoch, which is a good sign that gradient descent is working. As the article is getting long, and there is so much content I want to cover for the final step of the pipeline, evaluation, I will leave the last step for a future article.

[Fun Fact] Sigmoid Function

I discovered the video from Waite, E. (2020) on why we use the sigmoid function for binary classification, and it brings up an interesting interpretation of the sigmoid function. It first explains from a practical standpoint that the sigmoid function is chosen because of how easy it is to take derivatives and how well it is shaped for gradient descent. It makes perfect sense given our discussion above.

Secondly though, he adds that when we have two normal distributions of two classes with the same variance and compute the probability distribution of drawing a sample from the right distribution, we get the same curve as the sigmoid function. He showed some mathematical derivations to arrive at the sigmoid function, so check it out from the link below if you are interested. I never thought of that angle, and I personally think it is beautiful that we can justify the use of the sigmoid function from a statistical point of view.

Resources