Under the Hood of Large Language ModelsChapter 74

3. The Tiny GPT-Style Model

Section 4 of 8-~ 12 min read-Synced from Cuantum content

Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.

The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:

  • Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
  • Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
  • Weight tying: Shares weights between input embedding and output projection for parameter efficiency
  • Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)

The forward method shows the complete data flow through the model:

  1. Convert token IDs to embeddings
  1. Add positional information
  1. Process through a series of transformer blocks
  1. Apply final layer normalization
  1. Project to vocabulary logits for next-token prediction

This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.

class TinyGPT(nn.Module):    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):        super().__init__()        self.tok_embed = nn.Embedding(vocab_size, d_model)        self.pos = SinusoidalPositionalEncoding(d_model, max_len)        self.blocks = nn.ModuleList([            TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)        ])        self.ln_f = nn.LayerNorm(d_model)        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)         # Weight tying helps a bit on tiny setups        self.tok_embed.weight = self.lm_head.weight     def forward(self, idx):        x = self.tok_embed(idx)               # [B,T,C]        x = self.pos(x)        for blk in self.blocks:            x = blk(x)        x = self.ln_f(x)        logits = self.lm_head(x)              # [B,T,V]        return logits 

Here's a comprehensive breakdown of the TinyGPT class:

Class Definition:

TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.

Constructor Parameters:

  • vocab_size: Size of the vocabulary (number of unique tokens)
  • d_model: Dimension of the embedding vectors (default: 256)
  • n_layers: Number of transformer blocks (default: 4)
  • n_heads: Number of attention heads in each transformer block (default: 4)
  • d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
  • max_len: Maximum sequence length supported (default: 512)
  • dropout: Dropout probability for regularization (default: 0.1)

Component Initialization:

  • tokembed: An embedding layer that converts token IDs to dense vectors of size dmodel
  • pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
  • blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
  • ln_f: A final LayerNorm applied after all transformer blocks
  • lmhead: A linear layer that projects from dmodel dimensions to vocab_size, producing logits for next-token prediction

Weight Tying:

The code ties the weights of the token embedding (tokembed) and the output projection (lmhead) with the line: self.tokembed.weight = self.lmhead.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.

Forward Method:

The forward method defines the data flow through the model:

  1. Takes token indices (idx) as input
  1. Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
  1. Adds positional information using the pos encoder
  1. Sequentially processes the embeddings through each transformer block
  1. Applies the final layer normalization (ln_f)
  1. Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
  1. Returns the logits for further processing (typically computing loss or generating predictions)

Architecture Significance:

This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.

The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.