Under the Hood of Large Language ModelsChapter 89

8. Plug Into a Small Model (sanity run)

Section 9 of 11-~ 12 min read-Synced from Cuantum content

Use your tokenizer with a tiny model (e.g., DistilGPT-2) to ensure round-trip encoding/decoding works properly. This step serves as a crucial verification that your tokenizer functions correctly in a real model context. The round-trip test involves encoding text into tokens, passing those tokens through the model pipeline, and then decoding them back to text - confirming that the information flows correctly through the tokenization process.

For true training from scratch with a custom tokenizer, you would need to align the model's embedding layer dimensions to match your tokenizer's vocabulary size. This means initializing the model with an embedding matrix of shape [vocabsize, embeddingdimension], where vocab_size matches the number of tokens in your custom tokenizer. Without this alignment, the model would expect a different vocabulary size than what your tokenizer provides, resulting in index errors or undefined behavior during training or inference.

from transformers import AutoModelForCausalLM, AutoTokenizer # Using our fast BPE in a transformers-friendly wrapper:from transformers import PreTrainedTokenizerFasttok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]" # Quick encode/decode round tripex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")print(ex["input_ids"])print(tok.decode(ex["input_ids"][0]))

Here's a breakdown of this step, which demonstrates how to use a custom BPE tokenizer with the Hugging Face transformers library:

The code begins by importing necessary libraries from the transformers package:

from transformers import AutoModelForCausalLM, AutoTokenizer

Although AutoModelForCausalLM is imported, it's not actually used in this snippet. This import would typically be used to load a language model that could work with the tokenizer.

The code then imports the PreTrainedTokenizerFast class, which serves as a wrapper to make custom tokenizers compatible with the transformers library:

from transformers import PreTrainedTokenizerFast

Next, it loads the previously saved custom legal BPE tokenizer by specifying the path to the saved tokenizer file:

tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")

The code then sets special tokens for the tokenizer, which are essential for proper functioning with transformer models:

tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

These special tokens serve specific purposes:

[PAD]: Used for padding sequences to a uniform length

[BOS]: Marks the beginning of a sequence

[EOS]: Marks the end of a sequence

[UNK]: Represents unknown tokens not in the vocabulary

Finally, the code performs a round-trip test of the tokenizer with a legal text sample:

ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")print(ex["input_ids"])print(tok.decode(ex["input_ids"][0]))

This test:

Encodes the legal text "WHEREAS, the Parties amend the MSA." into token IDs, returning PyTorch tensors (return_tensors="pt")

Prints the resulting input_ids (the numerical representation of tokens)

Decodes the first sequence of input_ids back to text to verify the round-trip works correctly

This "sanity check" ensures the tokenizer correctly processes domain-specific legal terminology (like "MSA") and can be integrated with transformer models for further fine-tuning or inference tasks.

When training from scratch, initialize embeddings to len(tok) and ensure special token IDs align. When fine-tuning, you usually stick with the base model’s tokenizer—unless your domain truly demands a custom one.