Road to ML Engineer #8

Note: The article assumes you are familiar with the concepts of covariance and linear algebra (eigenvectors, eigenvalues, and projections). If you are not, I highly recommend checking out Covariance, Clearly Explained!!! by StatQuest and Essence of linear algebra by 3Blue1Brown.

Dimentionality Problem

In the last article, we covered the basics of a full machine learning pipeline, from model definition, model selection, regularization, to model evaluation. However, we only worked on one dataset, the Iris dataset, which had only 150 points and 5 features (dimensions). Hence, let's look at another, slightly more realistic (yet still not entirely realistic) example: the MNIST dataset.

The MNIST dataset is another notoriously used toy dataset, containing 60,000 image samples of handwritten digits. Let's see some examples of images in the MNIST dataset:

import keras
import matplotlib.pyplot as plt
 
# Download MNIST
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
 
# Display 10 samples
plt.figure(figsize=(10, 4))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X_train[i], cmap='gray')
    plt.axis('off')
plt.tight_layout()
plt.show()

While the above images are quite blurry, they are still of size 28 by 28 pixels, which contains 784 pixels in total. We can treat all the individual pixels as features and train a softmax regression model to classify digits, like below:

log(odds) = w_1 pixel_1 + w_2 pixel_2 + w_3 pixel_3 + ... + w_{784} pixel_{784} + w_{785}

Although it is not impossible to train such a model, especially with vectorized operations, we can expect it will take a long time to train the model on 60,000 images of that size. In more realistic image classification settings, we might encounter images of size 512 by 512 pixels, or 262,144 pixels, which is not tractable for a simple model. (The cover image of the article has 4051 by 2408 pixels!)

There are many other types of data that have high dimensions, such as audio, video, and gene data. Hence, we need some way of compressing those high-dimensional data down to low-dimensional data, or dimensionality reduction, so that we can train our models on them within limited time and computational resources.

Principal Component Analysis (PCA)

One way of performing such dimensionality reduction is Principal Component Analysis (PCA). In short, PCA takes a covariance matrix of data, computes the eigenvectors, and uses a few eigenvectors with the highest eigenvalues as the basis to reduce dimensions. Below shows how we can use two principal components to reduce the dimension from 784 to 2.

\overrightharpoon{img}_{original} = \begin{bmatrix} p_1 \\ p_2 \\ \vdots \\ p_{784} \end{bmatrix}, \overrightharpoon{pc}_1, \overrightharpoon{pc}_2 = \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_{784} \end{bmatrix}, \begin{bmatrix} d_1 \\ d_2 \\ \vdots \\ d_{784} \end{bmatrix} \\ \overrightharpoon{img}_{original} = \overrightharpoon{pc}_1 e_1 + \overrightharpoon{pc}_2 e_2 \\ \overrightharpoon{img}_{new} = \begin{bmatrix} e_1 \\ e_2 \end{bmatrix}

We can understand the reason for taking the eigenvectors of a covariance matrix with the highest eigenvalues conceptually, in that those eigenvectors have the highest impact in linearly transforming the data in a way that captures the linear relationships between the dimensions the most. However, you might still be wondering why it makes sense mathematically, so I would like to go over the maths as well.

Maths behind PCA

Let's only think about 2 pixels and 4 images at first. The data can be expressed as vectors, and we want to reduce the dimension by finding a 1-D line that can maximally express the distances between the vectors in 2 dimensions. We can pictorially observe that it happens when the projections of the vectors, or the variance of projections, are the largest.

The orange line loses information about the distances between the vectors as the points are squished together, while the green line, which maximizes the variance of projections, captures the distances in both dimensions as much as possible. Hence, our task is to find the unit vector that leads to the maximum variance of projections. The variance of projections can be expressed as follows:

Var(proj_{u}x) = \frac{1}{n} \sum_{i=0}^{n} (u x_i - u \bar{x})^2

, where $\bar{x}$ is the mean of $x$ . We can rearrange the above equation as follows:

Var(proj_{u}x) = \frac{1}{n} \sum_{i=0}^{n} [u (x_i - \bar{x})]^2\\ = u^T \frac{1}{n} \sum (x_i-\bar{x}) (x_i-\bar{x})^T u

From the above, we can find an equation for the covariance matrix ( $\frac{1}{n} \sum (x_i-\bar{x}) (x_i-\bar{x})^T$ ), which we can substitute as $C$ . Also, using Lagrange mutltiplier to satsify the condition of the line being a unit vector ( $u^T u = 1$ ), the following function must be maximized.

u^T C u - \lambda (u^T u - 1)

Taking the derivative of the above with respect to $u$ and setting it equal to 0, we get:

\frac{d}{d u} u^T C u - \lambda (u^T u - 1) = 0 \\ 2 C u - \lambda 2 u = 0 \\ Cu = \lambda u

Yes, it turns out that the set of $u$ s that maximizes the variance of projections are eigenvectors of covariance matrix $C$ . How do we choose which eigenvectors to use? We can rearrange the above equation one more time.

Cu = \lambda u \\ \lambda = u^T C u \\ \lambda = Var(proj_u x)

We can observe from the above equation that the higher the eigenvalues associated with those eigenvectors are, the higher the variance of projections becomes. Since we want to choose the eigenvectors that lead to the highest variance, we should select the ones with the highest eigenvalues. (If you happen to have better mathematical derivations for PCA or a more intuitive way to think about PCA, please contact me.)

Code Implementation

Luckily, sklearn provides a PCA class that does all of the above automatically.

from sklearn.decomposition import PCA
 
def performPCA(X_train, X_test, n_components=0.95):
  # find pca
  pca = PCA(n_components=n_components)
  pca.fit(X_train)
  # transform data into pca
  X_train = pca.transform(X_train) 
  X_test = pca.transform(X_test) 
  return X_train, X_test

The n_components parameter determines the number of principal components by stopping when the sum of eigenvalues of the principal components becomes n_components% of the total eigenvalues of all principal components. It means the data after dimensionality reduction using PCA will preserve n_components% of the linear relationship between the dimensions. Let's apply this function to the MNIST dataset.

# (, 28, 28) -> (, 784)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])
print("X_train's shape before PCA: ", X_train.shape)
print("X_test's shape before PCA: ", X_test.shape)
 
# Perform dimensionality reduction with PCA
X_train, X_test = performPCA(X_train, X_test)
print("X_train's shape after PCA: ", X_train.shape) # (, 154)
print("X_test's shape after PCA: ", X_test.shape) # (, 154)

After dimensionality reduction, we get 154 dimensions, which is a much easier number of dimensions to work with. It also turns out that we can use the nearest neighbor algorithm on the processed data to perform classification, effectively making PCA an unsupervised learning algorithm. (It is unsupervised because it learns the underlying relationships of data without using the class labels.)

As a challenge, I recommend you fit SoftmaxRegressionGD on the MNIST dataset after PCA. Also, there are many other dimensionality reduction / unsupervised learning techniques such as LDA, T-SNE, and UMAP, so I recommend checking them out if you are interested. (I might cover them in the future if requested.)

Resources

ritvikmath. 2020. Principal Component Analysis (The Math) : Data Science Concepts. YouTube