Day 41 - RNN Problems & Why Transformers Are Needed

Date: 2025-11-03 (Monday)
Status: “Done”


The Problem with RNNs: Sequential Processing Bottleneck

RNNs have dominated NLP for years, but they have fundamental limitations that transformers solve. Let’s explore these issues.

Problem 1: Sequential Computation

How RNNs Process Information

RNNs must process inputs one step at a time, sequentially:

Translation Example (English → French):

Input: "I am happy"

Time Step 1: Process "I"
Time Step 2: Process "am" 
Time Step 3: Process "happy"

Impact:

  • If your sentence has 5 words → 5 sequential steps required
  • If your sentence has 1000 words → 1000 sequential steps required
  • Cannot parallelize! Must wait for step t-1 before computing step t

Why This Matters

  • Modern GPUs and TPUs are designed for parallel computation
  • Sequential RNNs cannot take advantage of this parallelization
  • Training becomes much slower than necessary
  • Longer sequences = exponentially longer training time

Problem 2: Vanishing Gradient Problem

The Root Cause

When RNNs backpropagate through many time steps, gradients get multiplied repeatedly:

Gradient Flow Through T Steps:

∂Loss/∂h₀ = ∂Loss/∂hₜ × (∂hₜ/∂hₜ₋₁) × (∂hₜ₋₁/∂hₜ₋₂) × ... × (∂h₁/∂h₀)

If each ∂hᵢ/∂hᵢ₋₁ < 1 (which it often is):

  • After T multiplications: gradient ≈ 0.5^100 ≈ 0 (for T=100)
  • Gradient vanishes to zero
  • Model can’t learn long-range dependencies

Concrete Example

Sentence: “The students who studied hard… passed the exam”

  • Early word “students” needs to influence prediction of “passed”
  • But gradient has vanished by the time it reaches “students”
  • Model fails to learn this relationship!

Current Workarounds

LSTMs and GRUs help a little with gates, but:

  • Still have issues with very long sequences (>100-200 words)
  • Cannot fully solve the problem
  • Still require sequential processing

Problem 3: Information Bottleneck

The Compression Issue

Sequence-to-Sequence Architecture:

Encoder: Word₁ → h₁ → h₂ → h₃ → hₜ (final hidden state)
Decoder: hₜ → Word₁' → Word₂' → Word₃' → ...

The Bottleneck: All information from the entire input sequence is compressed into a single vector hₜ (the final hidden state).

Why This Fails

Example Sentence: “The government of the United States of America announced…”

When encoding this 8-word sentence:

  • First word “The” is processed
  • Information flows through states: h₁ → h₂ → h₃ → … → h₈
  • By h₈, information about “The” has been diluted/lost
  • Only h₈ is passed to decoder
  • Decoder has limited context about early words

The Consequence

  • Long sequences lose information
  • Important early context gets diluted
  • Model struggles with long documents
  • Translation quality degrades for long sentences

Summary: Why RNNs Have Fundamental Issues

Issue Impact Current Solution
Sequential Processing Cannot parallelize, slow training N/A - Fundamental to RNN design
Vanishing Gradients Can’t learn long-range dependencies LSTM/GRU gates (partial fix)
Information Bottleneck Early information gets lost Attention mechanism (partial fix)

The Transformer Solution: “Attention is All You Need”

Introduced in 2017 by Google researchers (Vaswani et al.), transformers completely replace RNNs with attention mechanisms.

Key Differences

Aspect RNNs Transformers
Processing Sequential (one word at a time) Parallel (all words at once)
Dependency Each word depends on previous state All words attend to all words
Training Speed Slow (sequential) Fast (parallel)
Long Sequences Vanishing gradients No sequential bottleneck
Long-range Deps. Difficult Easy (direct attention)

How Transformers Work (Brief Overview)

  1. No RNN: Remove sequential hidden states completely
  2. Pure Attention: Let each word “attend to” every other word
  3. Positional Encoding: Add position information since we don’t have sequential order
  4. Parallel Processing: Process entire sequence at once

Example:

Sentence: "I am happy"

Instead of:
Step 1: Process "I" → h₁
Step 2: Process "am" with h₁ → h₂
Step 3: Process "happy" with h₂ → h₃

Transformer Does:
Parallel: "I" attends to {"I", "am", "happy"}
Parallel: "am" attends to {"I", "am", "happy"}
Parallel: "happy" attends to {"I", "am", "happy"}
All at once! 

Why Everyone Talks About Transformers

Speed: Can train much faster on GPUs/TPUs (parallel) Scalability: Can handle very long sequences (no bottleneck) Long-range: Direct attention solves gradient problems Versatility: Works for translation, classification, QA, summarization, chatbots… Performance: Achieves state-of-the-art on nearly every NLP task


Applications of Transformers

Transformers are used for:

  1. Translation (Neural Machine Translation) - High quality, fast
  2. Text Summarization (Abstractive & Extractive)
  3. Named Entity Recognition (NER) - Identify entities better
  4. Question Answering - Understand context better
  5. Chatbots & Voice Assistants
  6. Sentiment Analysis - Understand emotion/opinion
  7. Auto-completion - Smart suggestions
  8. Classification - Classify text into categories
  9. Market Intelligence - Analyze market sentiment

State-of-the-Art Transformer Models

GPT-2 (Generative Pre-trained Transformer)

  • Created by: OpenAI
  • Type: Decoder-only transformer
  • Speciality: Text generation
  • Famous for: Generating human-like text (even fooled journalists in 2019!)

BERT (Bidirectional Encoder Representations from Transformers)

  • Created by: Google AI
  • Type: Encoder-only transformer
  • Speciality: Text understanding & representations
  • Use: Classification, NER, QA

T5 (Text-to-Text Transfer Transformer)

  • Created by: Google
  • Type: Full encoder-decoder (like original transformer)
  • Speciality: Multi-task learning
  • Super versatile: Single model handles translation, classification, QA, summarization, regression