Day 38 - Seq2seq Models & LSTM Deep Dive

Date: 2025-10-29 (Wednesday)
Status: “Done”


Seq2seq Model

Sequence-to-Sequence (Seq2seq) models introduce an encoder-decoder architecture effective for tasks like machine translation and text summarization.

Key Features:

  • Maps variable-length sequences to fixed-length memory
  • Input and outputs can have different lengths
  • Uses LSTMs and GRUs to avoid vanishing and exploding gradients
  • Encoder takes word tokens as input → hidden state vectors → decoder generates output sequence

LSTM Architecture: Deep Dive

What is LSTM?

LSTM (Long Short-Term Memory) is like a mini version of the human brain when processing memory.

LSTM Structure = 3 Gates + 1 Cell State

1. Forget Gate – Deciding What to Forget

Decides what information to discard from the old state.

Formula:

f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

Brain analogy:

  • Useless messages from someone who ghosted you → forget
  • Formulas you use daily → keep

2. Input Gate – Deciding What to Remember

Decides what new information to add to memory.

Formulas:

i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
Ĉ_t = tanh(W_C · [h_{t-1}, x_t] + b_C)

Brain analogy:

  • Valuable information → store in long-term memory
  • Irrelevant noise → discard immediately

3. Cell State Update – Long-term Memory

Updates long-term memory by combining forget and input gates.

Formula:

C_t = f_t ⊙ C_{t-1} + i_t ⊙ Ĉ_t

Where:

  • f_t ⊙ C_{t-1} = what to keep from old memory
  • i_t ⊙ Ĉ_t = what to add from new input

4. Output Gate – Deciding What to Output

Decides which memory to use for current output.

Formulas:

o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t ⊙ tanh(C_t)

Brain analogy:

  • When taking an NLP exam → recall LSTM formulas
  • When talking to someone → recall conversation context
  • When doing DevOps → recall AWS specs

LSTM vs Human Brain

Human Brain LSTM
Long-term memory Cell State
Filter out unnecessary information Forget Gate
Accept new valuable information Input Gate
Retrieve appropriate memory to respond Output Gate
Learn from sequential experiences RNN backbone
Don’t forget quickly Long-term dependencies

What is a Gate?

Gate = cognitive filter

Each gate = a mechanism that decides “keep or discard”

Example: When You Study NLP

  • Forget Gate: “Do I still need to remember this outdated method?” → Discard if no
  • Input Gate: “Is this new concept valuable?” → Store if yes
  • Output Gate: “What knowledge do I need right now?” → Retrieve relevant parts

Hidden State Limitations

Hidden state doesn’t have a token limit, but has a capacity limit for effective memory.

Mathematical Perspective:

  • Hidden state = fixed-size vector (e.g., 128, 256, 512 dimensions)
  • Can process 10 tokens or 10,000 tokens → won’t crash
  • Problem: can’t remember everything

Why?

  • Even with cell state, gradients weaken over many time steps
  • Long-term dependencies get lost
  • Tokens far from the start have weak influence on final output

Solution: This is why we need Attention mechanism!


Throttling in NLP

Two Meanings of Throttling:

1. System-Level Throttling (API)

Limiting request rate or token processing to:

  • Protect GPU resources
  • Distribute resources fairly
  • Avoid server overload
  • Control costs

Examples:

  • OpenAI GPT: 10 requests/second, 90k tokens/min
  • Anthropic Claude: 20 requests/second
  • HuggingFace: timeout if generation takes too long

2. Model-Level Throttling (Architecture)

LSTM, Transformer, and Attention all have mechanisms to limit information processing at any given time:

(A) LSTM Throttling → Forget Gate When sequence is too long:

  • Forget gate automatically “throttles” old information
  • Only allows part of meaning to pass through
  • Like network throttling: “overload → reduce bandwidth → drop packets”

(B) Transformer Throttling → Context Window Limit

  • BERT: 512 tokens
  • GPT-3: 2048-4096 tokens
  • GPT-4: 128k-1M tokens
  • Claude 3.5 Sonnet: 200k-1M tokens

When input exceeds limit:

  • Model cuts data
  • Or refuses to process
  • Or downgrades attention quality

(C) Attention Throttling → Sparse Attention In long-context models (Longformer, BigBird, Mistral):

  • Can’t compute full n² attention
  • Only attend to important regions (local attention)
  • Or global tokens
  • Or sliding window

(D) Token Generation Throttling Some decoders will:

  • Slow down token generation
  • Limit sampling
  • Apply temperature control
  • Cut beam search

When input is noisy or uncertain, this acts like a brake: “Not sure → slow down generation → increase quality”


Summary

LSTM is not just a model — it’s a computational mimicry of how human memory works. Understanding gates helps you understand why certain information persists while other information fades away.