Day 19: Attention Mechanism for LSTM in Machine Translation
On Day 19 of our deep learning journey, we tackled a complex but fascinating concept—adding an attention mechanism to an LSTM model for machine translation. Below, I’ll guide you step by step through the process of building this model and provide explanations for each part of the code to make everything clear and approachable.
Step 1: Import Necessary Libraries
First, we import the essential libraries for data handling, model building, and training:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dot, LSTM, Dense, Embedding, Activation, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow.keras.backend as K
- TensorFlow: Used to create and train the model.
Dot
,Concatenate
: These layers help build the attention mechanism.- Tokenizer and pad_sequences are used for text preprocessing.
Step 2: Data Preprocessing
We define some parameters related to the sequences, such as maximum length and vocabulary size, and preprocess the data.
max_encoder_seq_length = 20
max_decoder_seq_length = 20
input_vocab_size = 10000
output_vocab_size = 10000
embedding_dim = 128
max_encoder_seq_length
andmax_decoder_seq_length
: Define the maximum length of the input and output sequences.input_vocab_size
andoutput_vocab_size
: Vocabulary sizes for input (English) and output (French) sentences.embedding_dim
: Embedding vector size for the input and output sequences.
Tokenizer Setup
We initialize the tokenizers for both input and output sequences:
input_tokenizer = Tokenizer(num_words=input_vocab_size, filters='')
output_tokenizer = Tokenizer(num_words=output_vocab_size, filters='')
filters=''
ensures that special tokens like<start>
and<end>
are not filtered out during tokenization.
Texts and Tokenization
Next, we define our training sentences and tokenize them:
input_sequences = ["I am learning deep learning.", "This is a test sentence."]
output_sequences = ["<start> Je suis en train d'apprendre l'apprentissage profond. <end>",
"<start> Ceci est une phrase de test. <end>"]
input_tokenizer.fit_on_texts(input_sequences)
output_tokenizer.fit_on_texts(output_sequences)
input_sequences = input_tokenizer.texts_to_sequences(input_sequences)
output_sequences = output_tokenizer.texts_to_sequences(output_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_encoder_seq_length, padding='post')
output_sequences = pad_sequences(output_sequences, maxlen=max_decoder_seq_length, padding='post')
<start>
and<end>
tokens are added to the output sequences to help the model know where the output begins and ends.- Padding ensures that all sequences have the same length, which is necessary for batch processing.
Step 3: Define the Model Components
Encoder
The encoder takes the input sequence and produces a series of hidden states and the final states:
encoder_inputs = Input(shape=(max_encoder_seq_length,))
encoder_embedding = Embedding(input_dim=input_vocab_size, output_dim=embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(128, return_sequences=True, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
Input
defines the placeholder for the input data.Embedding
converts input words into dense vector representations.LSTM
processes these embeddings, returning hidden states (encoder_outputs
) for each time step and the final states (state_h
,state_c
). These final states are used to initialize the decoder.
Decoder
The decoder generates the output sequence by taking the encoder’s hidden states and using them to predict each word in the target sequence.
decoder_inputs_layer = Input(shape=(max_decoder_seq_length,))
decoder_embedding = Embedding(input_dim=output_vocab_size, output_dim=embedding_dim)(decoder_inputs_layer)
decoder_lstm = LSTM(128, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])
decoder_inputs_layer
represents the inputs to the decoder.Embedding
layer converts each token into a dense vector.LSTM
layer produces outputs for each time step, as well as the updated hidden states.
Step 4: Add Attention Mechanism
The attention mechanism allows the decoder to focus on different parts of the encoder’s output while generating each word:
attention = Dot(axes=[2, 2])([decoder_outputs, encoder_outputs])
attention = Activation('softmax')(attention)
context = Dot(axes=[2, 1])([attention, encoder_outputs])
decoder_combined_context = Concatenate(axis=-1)([context, decoder_outputs])
Dot
calculates the similarity between encoder outputs and decoder outputs, providing attention scores.Activation('softmax')
normalizes these scores, turning them into probabilities.Dot
again uses these scores to compute a context vector as a weighted sum of the encoder outputs.Concatenate
combines the context vector with the current decoder output.
Step 5: Output Layer
The concatenated output is passed through a dense layer to generate predictions for each word in the output vocabulary:
decoder_dense = Dense(output_vocab_size, activation='softmax')
decoder_output_final = decoder_dense(decoder_combined_context)
Dense(output_vocab_size)
is used to predict the next word’s probability distribution over the output vocabulary.
Step 6: Training Model Definition
We define the final model and compile it:
model = Model([encoder_inputs, decoder_inputs_layer], decoder_output_final)
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
Model
is built with both encoder and decoder inputs.sparse_categorical_crossentropy
is used as the loss function because our targets are not one-hot encoded.
Training the Model
We train the model with the preprocessed data:
model.fit(
[input_sequences, decoder_inputs],
output_sequences,
batch_size=64,
epochs=10,
validation_split=0.2
)
batch_size
is set to64
, and we train for10
epochs.validation_split
of0.2
keeps a portion of the data for validation.
Step 7: Inference Models for Translation
After training, we need separate encoder and decoder models for inference (translation).
Encoder Model for Inference
encoder_model = Model(encoder_inputs, [encoder_outputs, state_h, state_c])
- This model takes the encoder input and produces the encoder outputs and final states, which are used to initialize the decoder.
Decoder Model for Inference
decoder_state_input_h = Input(shape=(128,))
decoder_state_input_c = Input(shape=(128,))
encoder_output_input = Input(shape=(max_encoder_seq_length, 128))
decoder_embedding2 = Embedding(input_dim=output_vocab_size, output_dim=embedding_dim)(decoder_inputs_layer)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(decoder_embedding2, initial_state=[decoder_state_input_h, decoder_state_input_c])
attention2 = Dot(axes=[2, 2])([decoder_outputs2, encoder_output_input])
attention2 = Activation('softmax')(attention2)
context2 = Dot(axes=[2, 1])([attention2, encoder_output_input])
decoder_combined_context2 = Concatenate(axis=-1)([context2, decoder_outputs2])
decoder_output_final2 = decoder_dense(decoder_combined_context2)
decoder_model = Model(
[decoder_inputs_layer, encoder_output_input, decoder_state_input_h, decoder_state_input_c],
[decoder_output_final2, state_h2, state_c2]
)
decoder_state_input_h
anddecoder_state_input_c
represent the hidden and cell states fed into the LSTM at each step.- This model is used iteratively to generate the translated output word by word.