The blog post discusses about GloVe in deep learning.

Latent Semantic Analysis
In Word2Vec, we create a dataset containing the token of the central word, the tokens of context words, and their labels for each row. We can also combine these results to create a window-based co-occurrence matrix, where the rows are central words, the columns are context words, and the values are the co-occurrence counts. Then, we can apply Singular Value Decomposition (SVD) on the co-occurrence matrix to extract word embeddings or their low-rank approximation. This technique is called Latent Semantic Analysis (LSA) and can capture word similarity.
Here, is the co-ocurrence matrix, is the left singular vector (word vectors), contains the singular values, and is the right singular vector (context word vectors). Both and are orthogonal, and is a diagonal matrix, where its rank is the same as the rank of . (For more details regarding SVD, check out Singular Value Decomposition : Data Science Basics by ritvikmath.) To obtain the , we can multiply by itself as follows:
Since is orthogonal and is diagonal matrix, becomes an identity matrix, and becomes . Thus, and correspond to the eigenvectors and the square root of eigenvalues in the eigen decomposition of . Therefore, we can obtain by solving this system of equations. (For more details regarding eigen decompositon, check out Eigendecomposition : Data Science Basics by ritvikmath. ) LSA does not require training a neural network and leverages global statistics to efficiently create word embeddings that capture syntactical and semantic similarities between words. However, the method can place disproportionate emphasis on high-frequency words and cannot capture certain nuances that Word2Vec successfully captures.
GloVe
GloVe aims to combine Word2Vec's ability to capture subtle linguistic nuances with LSA's efficient use of global statistics by performing matrix factorization on a window-based co-occurrence matrix. Matrix factorization seeks to capture latent representations of the rows and columns, mostly in lower dimensions, by approximating the original matrix as the product of smaller matrices for rows and columns.
Matrix factorization typically uses learning algorithms like gradient descent to learn the best latent representations of and , and this technique is often used in recommender systems (I might discuss recommender systems later in this series). In GloVe, matrix factorization is performed on the window-based co-occurrence matrix to obtain the word embeddings .(In this case, both and share the same shape and represent words in a latent space, and it has been empirically shown that adding them produces good word embeddings.)
The above is the objective function we use for matrix factorization. It might look complicated, but it’s quite simple. We aim to approximate the product of the word vector and the context word vector () to the logarithm of the co-occurrence count , using a squared loss for every word pair , with the weight to handle low- and high-frequency words.
Code Implementation
To implement GloVe, we need to create a window-based co-occurrence matrix, which can be done efficiently using the following method. (All previous steps, such as text preprocessing and tokenization, remain the same as those used in the previous article.)
def create_cooccurrence_matrix(tokenized_corpus, window_size=5):
co_occurrence = defaultdict(float)
for i, word in enumerate(tokenized_corpus):
start = max(0, i - window_size)
end = min(len(tokenized_corpus), i + window_size + 1)
for j in range(start, end):
if i != j:
context_word = tokenized_corpus[j]
co_occurrence[(word, context_word)] += 1
return co_occurrence
From the co-occurrence matrix, we can create the training data as follows.
def create_training_data(co_occurrence):
words = []
contexts = []
counts = []
for (word, context_word), count in co_occurrence.items():
words.append(word)
contexts.append(context_word)
counts.append(count)
return np.array(words), np.array(contexts), np.array(counts, dtype=np.float32)
(The creation of datasets for TensorFlow and PyTorch is abbreviated here, as it is covered in the previous article.) Then, we can build our GloVe model with the objective function described above and train it. Below is the TensorFlow implementation of GloVe.
class GloVe(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim):
super(GloVe, self).__init__()
self.word_embedding = layers.Embedding(vocab_size,
embedding_dim,
embeddings_initializer="glorot_normal",
embeddings_regularizer="l2",)
self.context_embedding = layers.Embedding(vocab_size,
embedding_dim,
embeddings_initializer="glorot_normal",
embeddings_regularizer="l2",)
def call(self, pair):
word, context = pair
word_emb = self.word_embedding(word)
context_emb = self.context_embedding(context)
dots = tf.reduce_sum(word_emb * context_emb, axis=-1)
return dots
def custom_loss(y_pred, y_true):
y_true = tf.clip_by_value(y_true, clip_value_min=1e-5, clip_value_max=100)
f = y_true / 100
log_y_true = tf.math.log(y_true)
return 0.5 * f * tf.math.square(y_pred - log_y_true)
embedding_dim = 1024
vocab_size = len(tokens)
glove = GloVe(vocab_size, embedding_dim)
glove.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
loss=custom_loss)
glove.fit(dataset, epochs=15)
You might observe that the training takes significantly less time due to the use of global statistics, while the word embeddings produced by GloVe are found to be as expressive as those produced by Word2Vec. As a challenge, you might consider adding a function that creates word embeddings by adding the word embedding and the context embedding together and implementing the model in PyTorch.
Conclusion
In this article, we covered Latent Semantic Analysis (LSA) as an alternative approach to creating word embeddings, and discussed its benefits and drawbacks compared to Word2Vec, which led to the motivation behind GloVe. There are other alternatives, such as FastText, that you might want to explore if you're interested. Now that we have word embeddings, we are ready to build language models.
Resources
- Stanford University School of Engineering. 2017. Lecture 3 | GloVe: Global Vectors for Word Representation. YouTube.