Day 17: Building an LSTM Model for Sentiment Analysis

Welcome to Day 17 of our deep learning challenge! Today, we will build an LSTM (Long Short-Term Memory) model for sentiment analysis using the IMDb movie reviews dataset. Sentiment analysis aims to determine whether the sentiment of a given movie review is positive or negative.

Why LSTMs?

LSTMs are a special type of Recurrent Neural Network (RNN) that are especially good at learning from long-term dependencies. Unlike standard RNNs, LSTMs can remember information for longer periods, which makes them ideal for tasks involving text sequences, like sentiment analysis.

Step-by-Step Solution

Step 1: Import Libraries

First, we import all the necessary libraries for working with data, building our model, and training it.

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
import matplotlib.pyplot as plt
  • imdb: Contains the IMDb dataset, a collection of 50,000 movie reviews labeled as positive or negative.
  • pad_sequences: Used to make sure all input sequences are of the same length.
  • Embedding, LSTM, Dense, Dropout: Components to build our LSTM model.

Step 2: Load and Preprocess the IMDb Dataset

The IMDb dataset comes with pre-tokenized data. We need to load it and prepare it for training.

# Load the IMDb dataset
vocab_size = 10000  # Restricting the vocabulary to the 10,000 most common words
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

# Padding sequences to ensure uniform length
max_length = 200  # Set the maximum length for each review
X_train = pad_sequences(X_train, maxlen=max_length, padding='post')
X_test = pad_sequences(X_test, maxlen=max_length, padding='post')

Explanation

  • vocab_size = 10000: We’re limiting the number of unique words to 10,000 for simplicity.
  • imdb.load_data(num_words=vocab_size): Loads the dataset, including only the 10,000 most common words.
  • pad_sequences: Since reviews have different lengths, we use pad_sequences to make them all 200 words long. Shorter reviews are padded with zeros, while longer reviews are truncated.

Step 3: Define the LSTM Model

Now, we build the LSTM model.

# Define the LSTM model
model = Sequential()

# Embedding layer to convert word indices into dense vectors
embedding_dim = 128
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))

# LSTM layer with 128 units
model.add(LSTM(units=128, return_sequences=False))

# Adding a dropout layer to prevent overfitting
model.add(Dropout(0.5))

# Output layer with a single neuron for binary classification
model.add(Dense(1, activation='sigmoid'))

# Summary of the model
model.summary()

Explanation

  • Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length): Converts word indices to dense vectors of fixed length (embedding_dim=128). This layer learns an embedding representation for each word during training.
  • LSTM(units=128): The LSTM layer has 128 units, which means it can learn complex relationships in the input sequences.
  • Dropout(0.5): Adds dropout to prevent overfitting by randomly setting 50% of the units to zero during training.
  • Dense(1, activation='sigmoid'): A single neuron with sigmoid activation for binary classification (positive or negative).

Step 4: Compile the Model

We compile the model by specifying the optimizer, loss function, and metrics.

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Explanation

  • optimizer='adam': We use the Adam optimizer, which adjusts the learning rate during training for faster convergence.
  • loss='binary_crossentropy': Since we are dealing with a binary classification problem (positive vs. negative), we use binary cross-entropy as the loss function.
  • metrics=['accuracy']: We evaluate the model using accuracy.

Step 5: Train the Model

We train the model on the IMDb training dataset.

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=5,
    batch_size=64,
    validation_split=0.2,
    verbose=1
)

Explanation

  • epochs=5: Training for 5 epochs is sufficient for this example to see the learning trends.
  • batch_size=64: We process 64 samples per training step, which balances memory use and training speed.
  • validation_split=0.2: Use 20% of the training data as a validation set to monitor the model’s performance.

Step 6: Evaluate the Model

After training, we evaluate the model’s performance on the test dataset.

# Evaluate the model on test data
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy:.4f}")

Explanation

  • model.evaluate(X_test, y_test): Evaluates the model’s accuracy and loss on unseen test data to determine how well the model generalizes.
  • Test Accuracy: Prints the accuracy on the test dataset.

Step 7: Plot Training and Validation Loss

We plot the training and validation loss to understand how well the model learned.

# Plotting training and validation loss
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()

Explanation

  • Plotting the Loss: This helps visualize whether the model is overfitting (when validation loss is much higher than training loss) or underfitting (both losses remain high).

Summary of LSTM Sentiment Analysis Model

  • We built an LSTM model to classify movie reviews as positive or negative.
  • The model contains an embedding layer to learn word representations, an LSTM layer to learn sequence patterns, and a dropout layer to prevent overfitting.
  • The final output layer is a single neuron for binary classification.
  • We used the Adam optimizer and trained the model for 5 epochs.
  • The model was evaluated on a test dataset to determine its accuracy.

Possible Improvements

  • Increase Vocabulary Size: Increasing vocab_size may lead to a richer representation of words, which could improve accuracy.
  • Use Pre-trained Embeddings: Instead of learning embeddings from scratch, we could use pre-trained embeddings like GloVe to give the model a better starting point.
  • Experiment with LSTM Layers: Adding more LSTM layers or units could help the model learn more complex relationships.

Video