Day 28: Fine-Tuning the BERT Model for Sentiment Analysis

Today, I fine-tuned the BERT model on the IMDb dataset for a custom NLP task: sentiment analysis. Fine-tuning allows the pre-trained model to adapt to specific tasks and datasets, resulting in better performance compared to training from scratch.

Problem Statement

The task was to classify movie reviews from the IMDb dataset as:

Positive Sentiment: Label 1
Negative Sentiment: Label 0

The dataset consists of 50,000 movie reviews equally split into training and testing sets.

Fine-Tuning BERT

Code:

# Problem: Fine-tune the BERT model on a custom NLP task

from transformers import BertTokenizer, TFBertForSequenceClassification
from datasets import load_dataset
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau


# Load the data
# This dataset contains 50,000 movie reviews,
# split equally into training and testing sets,
# with labels indicating whether the review is positive (1) or negative (0).
dataset = load_dataset('imdb')

# Load the Bert Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_func(movie):
    return tokenizer(movie['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_func, batched=True)

# Prepare the Dataset
# Convert the tokenized dataset into a TensorFlow-friendly format.

train_dataset = tokenized_dataset['train'].to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols='label',
    shuffle=True,
    batch_size=16
)

test_dataset = tokenized_dataset['test'].to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols='label',
    shuffle=False,
    batch_size=16
)

# Build Bert model for classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.compile(
    optimizer=Adam(learning_rate=0.00005),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Add callbacks for better fine-tunning
checkpoint_callback = ModelCheckpoint(
    filepath='bert_finetuned_best_model.h5',
    save_best_only=True,
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    verbose=1
)

early_stopping_callback = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True,
    verbose=1
)

reduce_lr_callback = ReduceLROnPlateau(
    monitor='val_loss',
    patience=2,
    factor=0.5,
    min_lr=0.000006,
    verbose=1
)

# Train the model
model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=10,
    callbacks=[checkpoint_callback, early_stopping_callback, reduce_lr_callback]
)

# Load the best model for evaluation.
model.load_weights('bert_finetuned_best_model.h5')

# Evaluate the Model
loss, accuracy = model.evaluate(test_dataset)
print(f"Loss is {loss} and accuracy is: {accuracy}")

Step 1: Dataset Preparation

The IMDb dataset was loaded using the datasets library.
Reviews were tokenized using the bert-base-uncased tokenizer:
- Padding: Ensures input sequences are of equal length.
- Truncation: Trims longer reviews to a maximum length of 128 tokens.
- Max Length: Limits the tokenized sequences to 128 tokens.

The tokenized dataset was converted into a TensorFlow-friendly format using the to_tf_dataset method.

Step 2: BERT Model Configuration

The TFBertForSequenceClassification model was used:

Pre-trained Weights: bert-base-uncased.
Classification Head: A fully connected layer with 2 output neurons for binary classification.

Compilation Details:

Optimizer: Adam with a learning rate of 0.00005.
Loss Function: sparse_categorical_crossentropy for binary classification.
Metric: Accuracy.

Step 3: Callbacks for Fine-Tuning

Several callbacks were added to improve fine-tuning:

Model Checkpoint:
- Saves the best model weights based on validation loss.
Early Stopping:
- Stops training if the validation loss does not improve for 3 consecutive epochs.
- Restores the best weights at the end of training.
Reduce Learning Rate on Plateau:
- Reduces the learning rate by a factor of 0.5 if validation loss stagnates for 2 epochs.
- Prevents the model from getting stuck in a plateau.

Step 4: Training the Model

The model was trained for 10 epochs with:

Batch Size: 16 for both training and validation datasets.
Validation Data: Testing dataset was used for validation during training.
Callbacks: The three callbacks ensured efficient and effective training.

Step 5: Evaluation

After training, the model’s best weights (saved during training) were loaded for evaluation. The model was tested on the test dataset to calculate:

Loss: Measures the error in predictions.
Accuracy: Measures the percentage of correctly classified reviews.

Results

The fine-tuned model achieved the following results:

Loss: Approximately 0.27
Accuracy: 0.93 (93%)