The blog post discusses about BERT and GPT in deep learning.

Transformer
The basic transformer architecture consists of an encoder, which contextualizes the inputs, and a decoder, which produces outputs from the encoded inputs and its previous outputs. If you are unfamiliar with encoder-decoder architecture, check out the article, Road to ML Engineer #17 - Autoencoders. The encoder and decoder can be constructed from various layers we have covered so far, such as embedding, positional encoding, multi-headed self-attention, cross-attention, layer normalization, and dense layers as follows.

The cross-attention layer in the middle of the transformer uses keys and values from the encoder
to transform embeddings based on inputs. The initial input to the decoder is either the <SOS>
(start of sentence)
or <EOS>
(end of sentence) token, and the new output will be added to the right of the previous input for the next iteration.
Since we can have different sequences for input and output, this architecture can perform tasks like machine translation,
question answering, and next word prediction.
BERT
Although the above transformer architecture is powerful and versatile, we do not always need an encoder-decoder architecture. For example, we might want to perform document classification, which does not require the output to be a sequence. One alternative architecture is Bidirectional Encoder Representations from Transformers (BERT), which uses only the encoder to contextualize embeddings for tasks like document classification, sentiment analysis, and more.
To train BERT on a massive text corpus for appropriate contextualization of embeddings, we mask random words in a sentence and make the model predict the missing words based on the context. This context can be read from either the left or right side of a missing word, hence the name "Bidirectional Encoder Representations" from Transformers. The transformed embeddings from BERT can be passed to a feed-forward network (FNN) to perform various tasks, including next sentence prediction.
class TransformerEncoderBlock(layers.Layer):
def __init__(self, embed_dim, hidden_dim, num_heads=8):
super(TransformerEncoderBlock, self).__init__()
self.attention = MultiHeadedSelfAttention(num_heads, embed_dim, hidden_dim)
self.layer_norm1 = LayerNormalization(embed_dim)
self.mlp = tf.keras.Sequential([
layers.Dense(4 * embed_dim, activation="relu"),
layers.Dense(embed_dim)
])
self.layer_norm2 = LayerNormalization(embed_dim)
def call(self, x):
a = self.attention(x)
x = self.layer_norm1(x + a)
m = self.mlp(x)
x = self.layer_norm2(x + m)
return x
class BERT(tf.keras.Model):
def __init__(self, vocab_size, seq_len, embed_dim, hidden_dim, num_layers=4, num_heads=8):
super(BERT, self).__init__()
self.word_embedding = layers.Embedding(vocab_size,
embed_dim,
embeddings_initializer="he_normal",
embeddings_regularizer="l1",
name="w2v_embedding")
self.positional_encoding = positional_encoding(seq_len, embed_dim)
self.encoder = tf.keras.Sequential([
TransformerEncoderBlock(embed_dim, hidden_dim)
for i in range(num_layers)
])
self.classifier = tf.keras.Sequential([
layers.Dense(embed_dim, activation="relu"),
layers.Dense(vocab_size, activation="softmax")
])
def call(self, x):
x = self.word_embedding(x)
x += self.positional_encoding
x = self.encoder(x)
x = self.classifier(x)
return x
bert = BERT(2000, 1024, 1024, 128)
The example above is a TensorFlow implementation of a transformer encoder block and BERT. This example BERT model is relatively small, with an input vocabulary size of 200, a sequence length of 1024, an embedding size of 1024, and a hidden dimension of 128, yet it still has roughly 80 million parameters. Thus, training can take days to complete, even with GPUs and TPUs. After training BERT to predict randomly masked words, its word embeddings and encoder can produce contextual embeddings for a new classifier to perform various tasks.
GPT
Another transformer architecture is the Generative Pretrained Transformer (GPT), which is a decoder-only model specifically built for next-word generation. The architecture is similar to BERT but differs in that it only uses leftward context. During training, the masking is applied to the attention from the left to the right (by setting the attention values of the bottom left triangle to zero) so that GPT learns to predict each word strictly from the context to the left.
class TransformerDecoderBlock(layers.Layer):
def __init__(self, embed_dim, hidden_dim, num_heads=8):
super(TransformerDecoderBlock, self).__init__()
self.attention1 = MultiHeadedSelfAttention(num_heads, embed_dim, hidden_dim, masked=True)
self.layer_norm1 = LayerNormalization(embed_dim)
self.mlp = tf.keras.Sequential([
layers.Dense(4 * embed_dim, activation="relu"),
layers.Dense(embed_dim)
])
self.layer_norm2 = LayerNormalization(embed_dim)
def call(self, x):
a = self.attention1(x)
x = self.layer_norm1(x + a)
m = self.mlp(x)
x = self.layer_norm2(x + m)
return x
class GPT(tf.keras.Model):
def __init__(self, vocab_size, seq_len, embed_dim, hidden_dim, num_layers=4, num_heads=8):
super(GPT, self).__init__()
self.word_embedding = layers.Embedding(vocab_size,
embed_dim,
embeddings_initializer="he_normal",
embeddings_regularizer="l1",
name="w2v_embedding")
self.positional_encoding = positional_encoding(seq_len, embed_dim)
self.decoder = tf.keras.Sequential([
TransformerDecoderBlock(embed_dim, hidden_dim)
for i in range(num_layers)
])
self.classifier = tf.keras.Sequential([
layers.Dense(vocab_size, activation="softmax")
])
def call(self, x):
x = self.word_embedding(x)
x += self.positional_encoding
x = self.decoder(x)
x = self.classifier(x)
return x
gpt = GPT(2000, 1024, 1024, 128)
The example above shows a TensorFlow implementation of the decoder block and GPT.
Notice that the only visible difference here is the use of masked attention.
When training the model, we pass input sequences without the last token and output sequences
without the first token so that the model learns to predict the next word based on leftward context.
During inference, we can take the prediction of the next word and append it to the input sequence
until an <EOS>
token appears. Similar to BERT, training this model with roughly 80 million parameters
requires a significant amount of time and compute, but it can perform various tasks,
including machine translation, at a high level depending on the tokenizer and implementation.
Conclusion
In this article, we covered two popular transformer architectures for natural language processing: BERT and GPT. Both models tend to have millions or billions of parameters and can take days or even months to train, even with the best compute resources available, leading to discussions about their financial and environmental costs. However, their capabilities are unmatched by conventional models like RNNs, partly due to the inductive biases and parallelizability discussed in the previous article and the extensive research effort that has driven the evolution from GPT-1 to GPT-4.
This article series does not cover the implementation details of the multi-head self-attention block, data preparation, or inference for BERT and GPT, as well as PyTorch implementations of these models. I highly recommend trying to build pipelines and familiarize yourself with BERT and GPT if you are interested in the field of machine learning and NLP. (If you want help with it, reach out to me:)
Resources
- 3Blue1Brown. 2024. Attention in transformers, visually explained | Chapter 6, Deep Learning. YouTube.
- Britney, M. 2022. BERT 101 🤗 State Of The Art NLP Model Explained. Hugging Face.