Day 18: Understanding Attention Mechanism for LSTM in Machine Translation

Welcome to Day 18 of our deep learning challenge! Today, we will explore the theory behind the attention mechanism and understand how it can be added to an LSTM model for machine translation. We’ll focus on building a foundation that will help us implement this mechanism in code tomorrow (Day 19).

Why Attention?

In machine translation, like translating text from English to French, an LSTM processes the entire input sentence and then tries to generate the translation. The problem is that, as the sequence gets longer, the LSTM starts to forget parts of the information it received earlier. This makes it difficult for the LSTM to properly handle long sentences because it loses important details.

The attention mechanism helps solve this problem by allowing the model to focus on relevant parts of the input sentence at every step of the output generation. Essentially, it’s like the model asking itself, “Which words in the input sentence are most important for generating the next word?” and then focusing on those specific parts.

What is Attention Mechanism?

The attention mechanism is a method that allows the model to selectively focus on different parts of the input sequence. Instead of relying solely on the final output of an LSTM (which might be missing some important details), the attention mechanism assigns weights to each input word to decide how much attention should be paid to each word when producing the next word in the output.

Think of it like reading a book and summarizing it: instead of trying to remember the entire book, you selectively look back at the important sections that are most relevant to the summary you are writing at that moment. This helps you make a better, context-aware summary.

How Does Attention Work in LSTM?

Attention works in three main steps:

Scoring: Calculate a score for each input word to determine how important it is for the current output word.
Weighting: Use these scores to generate weights, which determine how much attention each word should receive.
Context Vector: Use the weights to create a context vector, which is a weighted combination of all the input words. This context vector is then used along with the LSTM’s hidden state to predict the next word.

Let’s break down these steps in a simpler way:

1. Scoring the Input Words

The model computes a score for each word in the input sentence to determine its relevance for generating the current output word. This score is usually computed based on the current hidden state of the LSTM (i.e., what the LSTM already knows) and each of the input words.

Think of this as the model deciding which parts of the input are important based on the current output that it’s generating.
The score can be computed using a simple neural network layer that takes the LSTM’s hidden state and each input word as inputs and produces a score.

Scoring Calculation

To understand how scores are calculated in an attention mechanism:

The score of each input word indicates how relevant that word is to the current output word that the model is generating.
The score is often calculated by using a simple layer in the network, like taking the dot product between the current hidden state of the LSTM and the embedding of each word in the input. The dot product essentially measures similarity: the higher the dot product, the more similar the two vectors are.

For example, imagine we have an input sentence with four words: ["I", "am", "learning", "LSTMs"]. Let’s say our LSTM is trying to generate the next word in the translated output. The LSTM’s current hidden state is compared with the representation of each of the input words to get scores like [2.0, 0.5, 3.0, 1.5]. This means that the word “learning” has the highest score of 3.0, indicating it’s the most relevant at this point.

2. Generating Weights

The scores are then normalized using a technique called softmax. Softmax converts these scores into values between 0 and 1, such that all the values add up to 1. These values are called attention weights.

The higher the weight, the more attention that word will receive.
The softmax function makes sure that the weights are easy to interpret as probabilities, which helps the model decide how much each word contributes to generating the current output word.

Generating Weights Calculation

Once we have the scores for each input word, we pass them through a softmax function to convert them into attention weights. These weights are values between 0 and 1 that add up to 1. For example, if we have the scores [2.0, 0.5, 3.0, 1.5], applying softmax might give us weights like [0.25, 0.10, 0.45, 0.20].

3. Creating the Context Vector

The model then uses these attention weights to compute a weighted sum of all the input words. This weighted sum is called the context vector.

The context vector is essentially a summary of all the input words, but it focuses more on the words with higher attention weights.
This context vector is then combined with the LSTM’s current hidden state to generate the final output for the current time step.

Context Vector Example

Each input word is represented by a vector of numbers called an embedding. Let’s say the embeddings are:
- “I”: [0.1, 0.2, 0.3, 0.4]
- “am”: [0.0, 0.1, 0.1, 0.1]
- “learning”: [0.4, 0.5, 0.5, 0.6]
- “LSTMs”: [0.3, 0.3, 0.2, 0.4]
The context vector is calculated by taking a weighted sum of these embeddings based on their attention weights:
- For “I”: 0.25 * [0.1, 0.2, 0.3, 0.4] = [0.025, 0.05, 0.075, 0.1]
- For “am”: 0.10 * [0.0, 0.1, 0.1, 0.1] = [0.0, 0.01, 0.01, 0.01]
- For “learning”: 0.45 * [0.4, 0.5, 0.5, 0.6] = [0.18, 0.225, 0.225, 0.27]
- For “LSTMs”: 0.20 * [0.3, 0.3, 0.2, 0.4] = [0.06, 0.06, 0.04, 0.08]
Now, add all these weighted embeddings together:
- [0.025, 0.05, 0.075, 0.1] + [0.0, 0.01, 0.01, 0.01] + [0.18, 0.225, 0.225, 0.27] + [0.06, 0.06, 0.04, 0.08] = [0.265, 0.345, 0.35, 0.46]

This final vector [0.265, 0.345, 0.35, 0.46] is called the context vector, and it serves as a summary of the input sentence with a focus on the important words. It will be used along with the LSTM’s current hidden state to generate the next word in the translated sentence.

Simple Analogy: Attention in a Conversation

Imagine you are translating a long paragraph from English to French. Each time you translate a sentence, you might want to look back at specific words or phrases in the original paragraph. You don’t try to keep everything in your head at once—you selectively look back to find the parts that are most relevant to what you are currently translating.

The attention mechanism in an LSTM works similarly: it looks back at the input sequence and decides which parts are important for generating each word of the output.

Attention in Machine Translation

In machine translation with LSTMs and attention:

The encoder reads the entire input sentence and produces a sequence of hidden states.
At each step of the decoder (which generates the translated sentence), the attention mechanism helps decide which parts of the input sentence are most relevant to the current word being generated.
This means the model doesn’t just rely on a single hidden state at the end of the input sentence—it uses information from all of the hidden states, focusing more on the most important ones.

Types of Attention

There are a few common types of attention used in machine translation:

Global Attention: The model looks at all input words when generating each output word.
Local Attention: The model looks at a small subset of input words, which makes it more efficient and sometimes more accurate for longer sequences.

Summary

Problem with LSTMs: LSTMs struggle with remembering long sequences because they have to store all the information in a single vector.
Attention Mechanism: Allows the model to focus on relevant parts of the input sequence, dynamically deciding which parts to pay more attention to when generating each word.
Key Steps: Calculate scores, generate attention weights, create a context vector.
Usefulness: Attention makes LSTMs much better at handling long sentences and complex relationships in sequences by selectively remembering important information.

Looking Forward to Day 19

Tomorrow, we’ll implement this attention mechanism in code to build a complete machine translation model using LSTMs. We’ll see how we can integrate attention to make our LSTM more powerful and accurate for translating text.