Internship Report > Worklog - AWS Learning Journey > Week 9 - Transformer Architecture & Implementation > Day 41 - RNN Problems & Why Transformers Are Needed

Day 41 - RNN Problems & Why Transformers Are Needed

Date: 2025-11-03 (Monday)
Status: “Done”

The Problem with RNNs: Sequential Processing Bottleneck

RNNs have dominated NLP for years, but they have fundamental limitations that transformers solve. Let’s explore these issues.

Problem 1: Sequential Computation

How RNNs Process Information

RNNs must process inputs one step at a time, sequentially:

Translation Example (English → French):

Input: "I am happy"

Time Step 1: Process "I"
Time Step 2: Process "am" 
Time Step 3: Process "happy"

Impact:

If your sentence has 5 words → 5 sequential steps required
If your sentence has 1000 words → 1000 sequential steps required
Cannot parallelize! Must wait for step t-1 before computing step t

Why This Matters

Modern GPUs and TPUs are designed for parallel computation
Sequential RNNs cannot take advantage of this parallelization
Training becomes much slower than necessary
Longer sequences = exponentially longer training time

Problem 2: Vanishing Gradient Problem

The Root Cause

When RNNs backpropagate through many time steps, gradients get multiplied repeatedly:

Gradient Flow Through T Steps:

∂Loss/∂h₀ = ∂Loss/∂hₜ × (∂hₜ/∂hₜ₋₁) × (∂hₜ₋₁/∂hₜ₋₂) × ... × (∂h₁/∂h₀)

If each ∂hᵢ/∂hᵢ₋₁ < 1 (which it often is):

After T multiplications: gradient ≈ 0.5^100 ≈ 0 (for T=100)
Gradient vanishes to zero
Model can’t learn long-range dependencies

Concrete Example

Sentence: “The students who studied hard… passed the exam”

Early word “students” needs to influence prediction of “passed”
But gradient has vanished by the time it reaches “students”
Model fails to learn this relationship!

Current Workarounds

LSTMs and GRUs help a little with gates, but:

Still have issues with very long sequences (>100-200 words)
Cannot fully solve the problem
Still require sequential processing

Problem 3: Information Bottleneck

The Compression Issue

Sequence-to-Sequence Architecture:

Encoder: Word₁ → h₁ → h₂ → h₃ → hₜ (final hidden state)
Decoder: hₜ → Word₁' → Word₂' → Word₃' → ...

The Bottleneck: All information from the entire input sequence is compressed into a single vector hₜ (the final hidden state).

Why This Fails

Example Sentence: “The government of the United States of America announced…”

When encoding this 8-word sentence:

First word “The” is processed
Information flows through states: h₁ → h₂ → h₃ → … → h₈
By h₈, information about “The” has been diluted/lost
Only h₈ is passed to decoder
Decoder has limited context about early words

The Consequence

Long sequences lose information
Important early context gets diluted
Model struggles with long documents
Translation quality degrades for long sentences

Summary: Why RNNs Have Fundamental Issues

Issue	Impact	Current Solution
Sequential Processing	Cannot parallelize, slow training	N/A - Fundamental to RNN design
Vanishing Gradients	Can’t learn long-range dependencies	LSTM/GRU gates (partial fix)
Information Bottleneck	Early information gets lost	Attention mechanism (partial fix)

The Transformer Solution: “Attention is All You Need”

Introduced in 2017 by Google researchers (Vaswani et al.), transformers completely replace RNNs with attention mechanisms.

Key Differences

Aspect	RNNs	Transformers
Processing	Sequential (one word at a time)	Parallel (all words at once)
Dependency	Each word depends on previous state	All words attend to all words
Training Speed	Slow (sequential)	Fast (parallel)
Long Sequences	Vanishing gradients	No sequential bottleneck
Long-range Deps.	Difficult	Easy (direct attention)

How Transformers Work (Brief Overview)

No RNN: Remove sequential hidden states completely
Pure Attention: Let each word “attend to” every other word
Positional Encoding: Add position information since we don’t have sequential order
Parallel Processing: Process entire sequence at once

Example:

Sentence: "I am happy"

Instead of:
Step 1: Process "I" → h₁
Step 2: Process "am" with h₁ → h₂
Step 3: Process "happy" with h₂ → h₃

Transformer Does:
Parallel: "I" attends to {"I", "am", "happy"}
Parallel: "am" attends to {"I", "am", "happy"}
Parallel: "happy" attends to {"I", "am", "happy"}
All at once!

Why Everyone Talks About Transformers

Speed: Can train much faster on GPUs/TPUs (parallel) Scalability: Can handle very long sequences (no bottleneck) Long-range: Direct attention solves gradient problems Versatility: Works for translation, classification, QA, summarization, chatbots… Performance: Achieves state-of-the-art on nearly every NLP task

Applications of Transformers

Transformers are used for:

Translation (Neural Machine Translation) - High quality, fast
Text Summarization (Abstractive & Extractive)
Named Entity Recognition (NER) - Identify entities better
Question Answering - Understand context better
Chatbots & Voice Assistants
Sentiment Analysis - Understand emotion/opinion
Auto-completion - Smart suggestions
Classification - Classify text into categories
Market Intelligence - Analyze market sentiment

State-of-the-Art Transformer Models

GPT-2 (Generative Pre-trained Transformer)

Created by: OpenAI
Type: Decoder-only transformer
Speciality: Text generation
Famous for: Generating human-like text (even fooled journalists in 2019!)

BERT (Bidirectional Encoder Representations from Transformers)

Created by: Google AI
Type: Encoder-only transformer
Speciality: Text understanding & representations
Use: Classification, NER, QA

T5 (Text-to-Text Transfer Transformer)

Created by: Google
Type: Full encoder-decoder (like original transformer)
Speciality: Multi-task learning
Super versatile: Single model handles translation, classification, QA, summarization, regression