Day 41 - RNN Problems & Why Transformers Are Needed
Date: 2025-11-03 (Monday)
Status: “Done”
The Problem with RNNs: Sequential Processing Bottleneck
RNNs have dominated NLP for years, but they have fundamental limitations that transformers solve. Let’s explore these issues.
Problem 1: Sequential Computation
RNNs must process inputs one step at a time, sequentially:
Translation Example (English → French):
Input: "I am happy"
Time Step 1: Process "I"
Time Step 2: Process "am"
Time Step 3: Process "happy"
Impact:
- If your sentence has 5 words → 5 sequential steps required
- If your sentence has 1000 words → 1000 sequential steps required
- Cannot parallelize! Must wait for step t-1 before computing step t
Why This Matters
- Modern GPUs and TPUs are designed for parallel computation
- Sequential RNNs cannot take advantage of this parallelization
- Training becomes much slower than necessary
- Longer sequences = exponentially longer training time
Problem 2: Vanishing Gradient Problem
The Root Cause
When RNNs backpropagate through many time steps, gradients get multiplied repeatedly:
Gradient Flow Through T Steps:
∂Loss/∂h₀ = ∂Loss/∂hₜ × (∂hₜ/∂hₜ₋₁) × (∂hₜ₋₁/∂hₜ₋₂) × ... × (∂h₁/∂h₀)
If each ∂hᵢ/∂hᵢ₋₁ < 1 (which it often is):
- After T multiplications: gradient ≈ 0.5^100 ≈ 0 (for T=100)
- Gradient vanishes to zero
- Model can’t learn long-range dependencies
Concrete Example
Sentence: “The students who studied hard… passed the exam”
- Early word “students” needs to influence prediction of “passed”
- But gradient has vanished by the time it reaches “students”
- Model fails to learn this relationship!
Current Workarounds
LSTMs and GRUs help a little with gates, but:
- Still have issues with very long sequences (>100-200 words)
- Cannot fully solve the problem
- Still require sequential processing
The Compression Issue
Sequence-to-Sequence Architecture:
Encoder: Word₁ → h₁ → h₂ → h₃ → hₜ (final hidden state)
Decoder: hₜ → Word₁' → Word₂' → Word₃' → ...
The Bottleneck:
All information from the entire input sequence is compressed into a single vector hₜ (the final hidden state).
Why This Fails
Example Sentence: “The government of the United States of America announced…”
When encoding this 8-word sentence:
- First word “The” is processed
- Information flows through states: h₁ → h₂ → h₃ → … → h₈
- By h₈, information about “The” has been diluted/lost
- Only h₈ is passed to decoder
- Decoder has limited context about early words
The Consequence
- Long sequences lose information
- Important early context gets diluted
- Model struggles with long documents
- Translation quality degrades for long sentences
Summary: Why RNNs Have Fundamental Issues
| Issue |
Impact |
Current Solution |
| Sequential Processing |
Cannot parallelize, slow training |
N/A - Fundamental to RNN design |
| Vanishing Gradients |
Can’t learn long-range dependencies |
LSTM/GRU gates (partial fix) |
| Information Bottleneck |
Early information gets lost |
Attention mechanism (partial fix) |
Introduced in 2017 by Google researchers (Vaswani et al.), transformers completely replace RNNs with attention mechanisms.
Key Differences
| Aspect |
RNNs |
Transformers |
| Processing |
Sequential (one word at a time) |
Parallel (all words at once) |
| Dependency |
Each word depends on previous state |
All words attend to all words |
| Training Speed |
Slow (sequential) |
Fast (parallel) |
| Long Sequences |
Vanishing gradients |
No sequential bottleneck |
| Long-range Deps. |
Difficult |
Easy (direct attention) |
- No RNN: Remove sequential hidden states completely
- Pure Attention: Let each word “attend to” every other word
- Positional Encoding: Add position information since we don’t have sequential order
- Parallel Processing: Process entire sequence at once
Example:
Sentence: "I am happy"
Instead of:
Step 1: Process "I" → h₁
Step 2: Process "am" with h₁ → h₂
Step 3: Process "happy" with h₂ → h₃
Transformer Does:
Parallel: "I" attends to {"I", "am", "happy"}
Parallel: "am" attends to {"I", "am", "happy"}
Parallel: "happy" attends to {"I", "am", "happy"}
All at once!
Speed: Can train much faster on GPUs/TPUs (parallel)
Scalability: Can handle very long sequences (no bottleneck)
Long-range: Direct attention solves gradient problems
Versatility: Works for translation, classification, QA, summarization, chatbots…
Performance: Achieves state-of-the-art on nearly every NLP task
Transformers are used for:
- Translation (Neural Machine Translation) - High quality, fast
- Text Summarization (Abstractive & Extractive)
- Named Entity Recognition (NER) - Identify entities better
- Question Answering - Understand context better
- Chatbots & Voice Assistants
- Sentiment Analysis - Understand emotion/opinion
- Auto-completion - Smart suggestions
- Classification - Classify text into categories
- Market Intelligence - Analyze market sentiment
- Created by: OpenAI
- Type: Decoder-only transformer
- Speciality: Text generation
- Famous for: Generating human-like text (even fooled journalists in 2019!)
- Created by: Google AI
- Type: Encoder-only transformer
- Speciality: Text understanding & representations
- Use: Classification, NER, QA
T5 (Text-to-Text Transfer Transformer)
- Created by: Google
- Type: Full encoder-decoder (like original transformer)
- Speciality: Multi-task learning
- Super versatile: Single model handles translation, classification, QA, summarization, regression