Road to ML Engineer #5 - Classification Model Metrics

In the past two articles, we covered logistic regression and softmax regression for binary and multiclass classifications, but we kept the model evaluation aside. In this article, we would like to see how those models can be evaluated.

Accuracy

One intuitive way of evaluating a classification model is to calculate the accuracy of the predictions as shown below:

\text{Accuracy} = \frac{\text{\# \space of \space successful \space predictions}}{\text{\# of predictions}}

The metric is very straightforward to compute and easy to understand, so it is used very often. However, it might not capture the full aspects of the predictions in some cases. Let's say we have 500 data points in the test dataset, containing 100 data points for Setosa and 400 data points for not Setosa. When a model A is so bad and only predicts not Setosa for all the data, we get:

\text{Accuracy} = \frac{400}{500} = 0.8

While the model is so bad that it makes mistakes all the time when the species is Setosa, it still achieves 80% accuracy. The accuracy performs poorly when the dataset is imbalanced.

Confusion Matrix

This is where the confusion matrix comes in. It plots how many data points are predicted to belong to which class relative to the actual class in a matrix. Below is the confusion matrix of the above scenario for model A.

Confusion Matrix of Model A
Predicted \ Actual	Setosa	Not Setosa
Setosa	0	0
Not Setosa	100	400

When the predicted and actual class is Setosa, we call it a true positive (TP) because the prediction is true in that it predicted the class to be Setosa or positive. When the prediction and actual class is not Setosa, we call it a true negative (TN) as the prediction is true that the class is not Setosa or negative.

When the predicted class is Setosa but is actually not Setosa, we call it a false positive (FP) because the model falsely predicted the class to be Setosa or positive. Finally, when the predicted class is not Setosa but is actually Setosa, we call it a false negative (FN) as the model falsely predicted the class to be not Setosa or negative. Using those terms, we can rewrite the accuracy as follows:

\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} \\ = \frac{0+400}{0+400+0+100} \\ = 0.8

We can see that while the model avoids false positives entirely, resulting in high accuracy, it is terrible in terms of false negatives.

Precision vs Recall

To capture the model performances in terms of false positives and false negatives, we can use precision and recall. Let's say our confusion matrix for model B turned out to be the following.

Confusion Matrix of Model A
Predicted \ Actual	Setosa	Not Setosa
Setosa	30	210
Not Setosa	70	190

The precision and recall of the model B can be calculated as:

\text{Precision} = \frac{TP}{TP+FP} \\ = \frac{30}{30+70} \\ = 0.3

\text{Recall} = \frac{TP}{TP+FN} \\ = \frac{30}{30+210} \\ = 0.125

As we can see from the above equations, precision is a ratio of true positives over data predicted to be positive, and recall is a ratio of true positives over actual positives. Practically, we can use precision for cases where we want to prioritize not having false positives and recall when we want to prioritize not having false negatives.

For example, if we are evaluating the model to classify if a patient has COVID or not, we don't want false negatives where patients go outside with COVID and spread the virus, while we can tolerate false positives as the consequence is just patients having to spend more time at home. Hence, we can use recall to evaluate such a classification model.

On the other hand, if we are evaluating the model to classify if an email is spam or not, we don't want false positives where important emails are classified as spam and taken down, whereas we can tolerate false negatives as the consequence is just seeing some spam emails occasionally and removing them from the thread. In this scenario, we can use precision for evaluating the model.

F1-Score

In many situations, we don't have a difference in the consequences of making false positives and false negatives. In such scenarios, we can take the harmonic mean of precision and recall to compute the F1-score.

\text{F1-score} = \frac{2*\text{Precision}*\text{Recall}}{\text{Precision}+\text{Recall}}

Why do we take the harmonic mean instead of the arithmetic mean? It is because harmonic mean requires both precision and recall to be reasonably high for F1-score to be high. Suppose precision is 1.0 and recall is 0.1. The arithmetic mean is $\frac{1 + 0.1}{2} = 0.55$ while the harmonic mean or F1-score is $\frac{2*1*0.1}{1 + 0.1} = 0.18$ . We can observe that small recall has greater impact on the harmonic mean than on the arithmetic mean. Thus, we can use the harmonic mean for F1-score to evaluate if both precision and recall are high.

Macro and Weighted Average F1-Score

So far, we have been talking about evaluating binary classification models. How do we compute them in a multiclass classification model? It is pretty simple. We can compute the precision, recall, and F1-score for each class by treating other classes as negative. Let's use the example of model C trained on the Iris dataset to see how it works.

Confusion Matrix of Model C
Predicted \ Actual	Setosa	Versicolor	Virginica
Setosa	30	40	50
Versicolor	30	10	20
Virginica	40	50	30

First, we can generate 3 confusion matrices for each species like below.

Confusion Matrix of Model C Setosa
Predicted \ Actual	Setosa	Not Setosa
Setosa	30	90
Not Setosa	70	110

Confusion Matrix of Model C Versicolor
Predicted \ Actual	Versicolor	Not Versicolor
Versicolor	10	50
Not Versicolor	90	150

Confusion Matrix of Model C Virginica
Predicted \ Actual	Virginica	Not Virginica
Virginica	30	90
Not Virginica	70	110

Using that, we can calculate F1-scores for each species as follows.

\text{F1-score}_{\text{Setosa}} = 2\frac{\frac{30}{30+90}\frac{30}{30+70}}{\frac{30}{30+90}+\frac{30}{30+70}} \\ = 2\frac{0.25*0.3}{0.25+0.3} \\ = 0.272

\text{F1-score}_{\text{Versicolor}} = 2\frac{\frac{10}{10+50}\frac{10}{10+90}}{\frac{10}{10+50}+\frac{10}{10+90}} \\ = 2\frac{0.167*0.1}{0.167+0.1} \\ = 0.125

\text{F1-score}_{\text{Virginica}} = 2\frac{\frac{30}{30+90}\frac{30}{30+70}}{\frac{30}{30+90}+\frac{30}{30+70}} \\ = 2\frac{0.25*0.3}{0.25+0.3} \\ = 0.272

To arrive at a single value to evaluate multiclass classification models, we can use micro average F1-score, where we simply take the arithmetic mean of the above F1-scores.

\text{macro-avg F1-score} = \frac{0.272+0.125+0.272}{3} \\ = 0.223

However, if the dataset is imbalance, we want to relfect the F1-score with more data in the average. In such situation, we can use weighted average F1-score where we take the ratio of the number of data points in that class and total number of data points as weights of each class and calculate the weighted sum of F1-scores.

\text{weighted-avg F1-score} = \frac{100}{300}*0.272+\frac{100}{300}*0.125+\frac{100}{300}*0.272 \\ = 0.223

In this case, the test dataset was perfectly balanced, so there is no difference in value between micro average and weighted average F1-score.

Code Implementation

Let's see how we can use the above metrics on LogisticRegressionGD and SoftmaxRegressionGD that we defined and trained.

`LogisticRegressionGD`

First, let's see how it works on LogisticRegressionGD. Before we compute the metrics, we need to first obtain the model's predictions on the test dataset.

pred = lr.predict(X_test)
 
pred = np.round(pred) # Rounding to make classification

One important thing to remember is that model predicts probabilty of an iris being Setosa, ranging from 0 to 1. We need to round it to apply the threshold and perform classification. Then, we can draw a confusion matrix using confusion_matrix and display it with ConfusionMatrixDisplay from sklearn.metrics.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
cm = confusion_matrix(y_test, pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

If you run the above, you should see something like this.

You can see that the model classifed the test dataset perfectly and has no false positive nor negative. We can compute accuracy, precision, recall, F1-score using predefined functions provided by sklearn.metrics.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
 
print(f"Accuracy: {accuracy_score(y_test, pred)}")
print(f"Precision: {precision_score(y_test, pred)}")
print(f"Recall: {recall_score(y_test, pred)}")
print(f"F1 Score: {f1_score(y_test, pred)}")
# =>
# Accuracy: 1.0
# Precision: 1.0
# Recall: 1.0
# F1 Score: 1.0

`SoftmaxRegressionGD`

The procedure is basically the same for SoftmaxRegressionGD. Let's start by making predictions first.

pred = sm.predict(X_test)
 
# One-hot encoded vector -> Index
pred = np.argmax(pred, axis=1)
y_test = np.argmax(y_test, axis=1)

In the case of softmax regression, we encode the species as one-hot encoded vectors, and the model predicts probability distribututions. Therefore, we need to take the argmax to get back to species. Let's plot confusion matrix.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
cm = confusion_matrix(y_test, pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

If you run the above, you should see something like this.

From the confusion matrix, we can see that the model misclassified 5 Setosa as Versicolor. Although we can use fl_score and other functions we used for obtaining F1-score and other metrics, it is more convinient to use classification_report from sklearn.metrics for multiclass classification.

from sklearn.metrics import classification_report
 
print(classification_report(y_test, pred))
 
# =>
#               precision    recall  f1-score   support
 
#            0       1.00      0.67      0.80        15
#            1       0.81      1.00      0.90        22
#            2       1.00      1.00      1.00        13
 
#     accuracy                           0.90        50
#    macro avg       0.94      0.89      0.90        50
# weighted avg       0.92      0.90      0.90        50

From the classification report, we can observe all the metrics; precision, recall, F1-score for each class, macro average F1-score, and weighted average F1-score, and so on. We can observe that macro average and weighted average of both precision and recall are slightly different because of slight class imbalance in the test dataset.

Conclusion

As the article is getting super long, we will call it a day. We covered a few major metrics we can use for evaluating classification models, but there are many other metrics we can use, like ROC curve, AUC, and so on. It is important to choose the right metrics depending on the task. In many cases, there are standard metrics for that specific task used by many, so you just need to use the same ones that others are using. However, if you are tackling a new challenge, you will need to choose from the various metrics or might need to invent one. Regardless, be sure to understand what the metrics do and what they are made for when choosing metrics.