Day 19: Attention Mechanism for LSTM in Machine Translation

On Day 19 of our deep learning journey, we tackled a complex but fascinating concept—adding an attention mechanism to an LSTM model for machine translation. Below, I’ll guide you step by step through the process of building this model and provide explanations for each part of the code to make everything clear and approachable.

Step 1: Import Necessary Libraries

First, we import the essential libraries for data handling, model building, and training:

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dot, LSTM, Dense, Embedding, Activation, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow.keras.backend as K

TensorFlow: Used to create and train the model.
Dot, Concatenate: These layers help build the attention mechanism.
Tokenizer and pad_sequences are used for text preprocessing.

Step 2: Data Preprocessing

We define some parameters related to the sequences, such as maximum length and vocabulary size, and preprocess the data.

max_encoder_seq_length = 20
max_decoder_seq_length = 20
input_vocab_size = 10000
output_vocab_size = 10000
embedding_dim = 128

max_encoder_seq_length and max_decoder_seq_length: Define the maximum length of the input and output sequences.
input_vocab_size and output_vocab_size: Vocabulary sizes for input (English) and output (French) sentences.
embedding_dim: Embedding vector size for the input and output sequences.

Tokenizer Setup

We initialize the tokenizers for both input and output sequences:

input_tokenizer = Tokenizer(num_words=input_vocab_size, filters='')
output_tokenizer = Tokenizer(num_words=output_vocab_size, filters='')

filters='' ensures that special tokens like <start> and <end> are not filtered out during tokenization.

Texts and Tokenization

Next, we define our training sentences and tokenize them:

input_sequences = ["I am learning deep learning.", "This is a test sentence."]
output_sequences = ["<start> Je suis en train d'apprendre l'apprentissage profond. <end>",
                    "<start> Ceci est une phrase de test. <end>"]

input_tokenizer.fit_on_texts(input_sequences)
output_tokenizer.fit_on_texts(output_sequences)

input_sequences = input_tokenizer.texts_to_sequences(input_sequences)
output_sequences = output_tokenizer.texts_to_sequences(output_sequences)

input_sequences = pad_sequences(input_sequences, maxlen=max_encoder_seq_length, padding='post')
output_sequences = pad_sequences(output_sequences, maxlen=max_decoder_seq_length, padding='post')

<start> and <end> tokens are added to the output sequences to help the model know where the output begins and ends.
Padding ensures that all sequences have the same length, which is necessary for batch processing.

Step 3: Define the Model Components

Encoder

The encoder takes the input sequence and produces a series of hidden states and the final states:

encoder_inputs = Input(shape=(max_encoder_seq_length,))
encoder_embedding = Embedding(input_dim=input_vocab_size, output_dim=embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(128, return_sequences=True, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)

Input defines the placeholder for the input data.
Embedding converts input words into dense vector representations.
LSTM processes these embeddings, returning hidden states (encoder_outputs) for each time step and the final states (state_h, state_c). These final states are used to initialize the decoder.

Decoder

The decoder generates the output sequence by taking the encoder’s hidden states and using them to predict each word in the target sequence.

decoder_inputs_layer = Input(shape=(max_decoder_seq_length,))
decoder_embedding = Embedding(input_dim=output_vocab_size, output_dim=embedding_dim)(decoder_inputs_layer)
decoder_lstm = LSTM(128, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])

decoder_inputs_layer represents the inputs to the decoder.
Embedding layer converts each token into a dense vector.
LSTM layer produces outputs for each time step, as well as the updated hidden states.

Step 4: Add Attention Mechanism

The attention mechanism allows the decoder to focus on different parts of the encoder’s output while generating each word:

attention = Dot(axes=[2, 2])([decoder_outputs, encoder_outputs])
attention = Activation('softmax')(attention)
context = Dot(axes=[2, 1])([attention, encoder_outputs])

decoder_combined_context = Concatenate(axis=-1)([context, decoder_outputs])

Dot calculates the similarity between encoder outputs and decoder outputs, providing attention scores.
Activation('softmax') normalizes these scores, turning them into probabilities.
Dot again uses these scores to compute a context vector as a weighted sum of the encoder outputs.
Concatenate combines the context vector with the current decoder output.

Step 5: Output Layer

The concatenated output is passed through a dense layer to generate predictions for each word in the output vocabulary:

decoder_dense = Dense(output_vocab_size, activation='softmax')
decoder_output_final = decoder_dense(decoder_combined_context)

Dense(output_vocab_size) is used to predict the next word’s probability distribution over the output vocabulary.

Step 6: Training Model Definition

We define the final model and compile it:

model = Model([encoder_inputs, decoder_inputs_layer], decoder_output_final)
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.summary()

Model is built with both encoder and decoder inputs.
sparse_categorical_crossentropy is used as the loss function because our targets are not one-hot encoded.

Training the Model

We train the model with the preprocessed data:

model.fit(
    [input_sequences, decoder_inputs],
    output_sequences,
    batch_size=64,
    epochs=10,
    validation_split=0.2
)

batch_size is set to 64, and we train for 10 epochs.
validation_split of 0.2 keeps a portion of the data for validation.

Step 7: Inference Models for Translation

After training, we need separate encoder and decoder models for inference (translation).

Encoder Model for Inference

encoder_model = Model(encoder_inputs, [encoder_outputs, state_h, state_c])

This model takes the encoder input and produces the encoder outputs and final states, which are used to initialize the decoder.

Decoder Model for Inference

decoder_state_input_h = Input(shape=(128,))
decoder_state_input_c = Input(shape=(128,))
encoder_output_input = Input(shape=(max_encoder_seq_length, 128))

decoder_embedding2 = Embedding(input_dim=output_vocab_size, output_dim=embedding_dim)(decoder_inputs_layer)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(decoder_embedding2, initial_state=[decoder_state_input_h, decoder_state_input_c])

attention2 = Dot(axes=[2, 2])([decoder_outputs2, encoder_output_input])
attention2 = Activation('softmax')(attention2)
context2 = Dot(axes=[2, 1])([attention2, encoder_output_input])

decoder_combined_context2 = Concatenate(axis=-1)([context2, decoder_outputs2])
decoder_output_final2 = decoder_dense(decoder_combined_context2)

decoder_model = Model(
    [decoder_inputs_layer, encoder_output_input, decoder_state_input_h, decoder_state_input_c],
    [decoder_output_final2, state_h2, state_c2]
)

decoder_state_input_h and decoder_state_input_c represent the hidden and cell states fed into the LSTM at each step.
This model is used iteratively to generate the translated output word by word.