NLP with Transformers: Fundamentals and Core ApplicationsChapter 96
6. Step 3: Tokenizing the Dataset
Section 6 of 10-~ 12 min read-Synced from Cuantum content
BERT requires tokenized input, so we’ll use its tokenizer to preprocess the text.
Code Example: Tokenization
# Load BERT tokenizertokenizer = BertTokenizer.from_pretrained("bert-base-uncased") # Tokenize datasetdef tokenize_function(examples): return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128) # Apply tokenizationtokenized_train = train_data.map(tokenize_function, batched=True)tokenized_test = test_data.map(tokenize_function, batched=True)