Under the Hood of Large Language ModelsChapter 74

3. The Tiny GPT-Style Model

Section 4 of 8-~ 12 min read-Synced from Cuantum content

Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.

The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:

Modular design: Combines embedding, positional encoding, transformer blocks, and output projection

Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.

Weight tying: Shares weights between input embedding and output projection for parameter efficiency

Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)

The forward method shows the complete data flow through the model:

Convert token IDs to embeddings

Add positional information

Process through a series of transformer blocks

Apply final layer normalization

Project to vocabulary logits for next-token prediction

This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.

class TinyGPT(nn.Module):    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):        super().__init__()        self.tok_embed = nn.Embedding(vocab_size, d_model)        self.pos = SinusoidalPositionalEncoding(d_model, max_len)        self.blocks = nn.ModuleList([            TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)        ])        self.ln_f = nn.LayerNorm(d_model)        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)         # Weight tying helps a bit on tiny setups        self.tok_embed.weight = self.lm_head.weight     def forward(self, idx):        x = self.tok_embed(idx)               # [B,T,C]        x = self.pos(x)        for blk in self.blocks:            x = blk(x)        x = self.ln_f(x)        logits = self.lm_head(x)              # [B,T,V]        return logits

Here's a comprehensive breakdown of the TinyGPT class:

Class Definition:

TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.

Constructor Parameters:

vocab_size: Size of the vocabulary (number of unique tokens)

d_model: Dimension of the embedding vectors (default: 256)

n_layers: Number of transformer blocks (default: 4)

n_heads: Number of attention heads in each transformer block (default: 4)

d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)

max_len: Maximum sequence length supported (default: 512)

dropout: Dropout probability for regularization (default: 0.1)

Component Initialization:

tokembed: An embedding layer that converts token IDs to dense vectors of size dmodel

pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings

blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters

ln_f: A final LayerNorm applied after all transformer blocks

lmhead: A linear layer that projects from dmodel dimensions to vocab_size, producing logits for next-token prediction

Weight Tying:

The code ties the weights of the token embedding (tokembed) and the output projection (lmhead) with the line: self.tokembed.weight = self.lmhead.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.

Forward Method:

The forward method defines the data flow through the model:

Takes token indices (idx) as input

Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]

Adds positional information using the pos encoder

Sequentially processes the embeddings through each transformer block

Applies the final layer normalization (ln_f)

Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]

Returns the logits for further processing (typically computing loss or generating predictions)

Architecture Significance:

This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.

The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.