Day 27: Building a Transformer-Based Model (BERT) for Text Classification on IMDb Dataset
Today, I implemented a BERT-based transformer model to classify movie reviews as either positive or negative using the IMDb dataset. This was my first dive into transformers for text classification, and it was an exciting exploration into natural language processing (NLP) using state-of-the-art models.
Problem Statement
The task was to classify movie reviews from the IMDb dataset as either:
- Positive: Label
1
- Negative: Label
0
The dataset consists of 50,000 movie reviews, split equally into training and testing sets.
Approach
Code:
# Problem: Build a simple transformer-based model (BERT) for text classification (IMDb Dataset)
from transformers import BertTokenizer, TFBertForSequenceClassification
from datasets import load_dataset
from tensorflow.keras.optimizers import Adam
# Load the data
# This dataset contains 50,000 movie reviews,
# split equally into training and testing sets,
# with labels indicating whether the review is positive (1) or negative (0).
dataset = load_dataset('imdb')
# Load the Bert Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize_func(movie):
return tokenizer(movie['text'], padding='max_length', truncation=True, max_length=128)
# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_func, batched=True)
# Prepare the Dataset
# Convert the tokenized dataset into a TensorFlow-friendly format.
train_dataset = tokenized_dataset['train'].to_tf_dataset(
columns=['input_ids', 'attention_mask'],
label_cols='label',
shuffle=True,
batch_size=16
)
test_dataset = tokenized_dataset['test'].to_tf_dataset(
columns=['input_ids', 'attention_mask'],
label_cols='label',
shuffle=False,
batch_size=16
)
# Build Bert model for classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.compile(
optimizer=Adam(learning_rate=0.00005),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Train the model
model.fit(
train_dataset,
validation_data=test_dataset,
epochs=3
)
# Evaluate the Model
loss, accuracy = model.evaluate(test_dataset)
print(f"Loss is {loss} and accuracy is: {accuracy}")
Step 1: Dataset Loading
The IMDb dataset was loaded using the datasets
library:
- Training Set: 25,000 movie reviews.
- Testing Set: 25,000 movie reviews.
The reviews were then tokenized to make them compatible with the BERT model.
Step 2: Tokenization with BERT Tokenizer
I used the pre-trained bert-base-uncased
tokenizer from the Hugging Face Transformers library:
- Padding: Ensures all input sequences are of equal length.
- Truncation: Trims longer reviews to a maximum length of 128 tokens.
- Max Length: Limits the tokenized sequences to 128 tokens for efficient training.
The tokenized dataset was then converted into a TensorFlow-friendly format using the to_tf_dataset
method, which supports:
- Input Columns:
input_ids
andattention_mask
. - Labels: Positive or negative sentiment (
label
column).
Step 3: Model Architecture
The model used was TFBertForSequenceClassification:
- Pre-trained Weights:
bert-base-uncased
, which has already been trained on a large corpus of English text. - Classification Head: A simple fully connected layer with 2 output neurons (for binary classification).
The model was compiled with:
- Optimizer: Adam with a learning rate of
0.00005
. - Loss Function:
sparse_categorical_crossentropy
for multi-class classification. - Metric: Accuracy.
Step 4: Training
The model was trained on the tokenized training set for 3 epochs with the following:
- Batch Size: 16 for both training and testing datasets.
- Validation Data: Testing dataset was used for validation during training.
Step 5: Evaluation
After training, the model was evaluated on the test dataset to calculate:
- Loss: Indicates the error in predictions.
- Accuracy: Measures how many reviews were correctly classified.