5. Preprocess the Dataset

Section 5 of 9-~ 12 min read-Synced from Cuantum content

Before feeding the data into BERT, we need to tokenize the text using BERT’s tokenizer.

from transformers import BertTokenizer # Load BERT tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Tokenize the datasetdef tokenize_function(examples):    return tokenizer(examples['text'], padding="max_length", truncation=True) tokenized_datasets = dataset.map(tokenize_function, batched=True)

Let's break down this code that preprocesses data for BERT:

1. Import and Initialize Tokenizer:

from transformers import BertTokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

This loads BERT's tokenizer, specifically the uncased version which treats uppercase and lowercase letters the same.

2. Define Tokenization Function:

def tokenize_function(examples):    return tokenizer(examples['text'], padding="max_length", truncation=True)

This function:

Takes input text from the dataset

Applies padding to ensure all inputs have the same length

Uses truncation to handle texts that exceed the model's maximum length

3. Apply Tokenization:

tokenized_datasets = dataset.map(tokenize_function, batched=True)

This step converts raw text into token IDs that BERT can understand. The padding ensures all inputs are of the same length, and truncation handles texts longer than the model’s maximum input size.