Date: 2025-11-04 (Tuesday)
Status: “Done”
The transformer model introduced in the paper “Attention is All You Need” revolutionized NLP. Let’s understand its complete structure.
INPUT SEQUENCE
↓
[Tokenization & Embedding]
↓
[Add Positional Encoding]
↓
┌─────────────────────────────────┐
│ ENCODER (N layers) │
│ ├─ Multi-Head Attention │
│ ├─ Layer Normalization │
│ ├─ Feed-Forward Network │
│ └─ Residual Connections │
└─────────────────────────────────┘
↓
[Context Vectors from Encoder]
↓
┌─────────────────────────────────┐
│ DECODER (N layers) │
│ ├─ Masked Multi-Head Attention │
│ ├─ Encoder-Decoder Attention │
│ ├─ Feed-Forward Network │
│ └─ Layer Normalization │
└─────────────────────────────────┘
↓
[Linear Layer + Softmax]
↓
OUTPUT PROBABILITIES
Each word is converted to a dense vector (typically 512-1024 dimensions).
Example:
Word: "happy"
Embedding: [0.2, -0.5, 0.8, ..., 0.1] // 512 values
Problem: Transformers don’t have sequential order built in (unlike RNNs). So we must add positional information explicitly.
Solution: Add positional encoding vectors to embeddings.
Formula:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
- pos = position in sequence (0, 1, 2, ...)
- i = dimension index
- d_model = embedding dimension (512, 1024, etc.)
Intuition:
Example:
Embedding("I") = [0.2, -0.5, 0.8, ..., 0.1]
PE(pos=0) = [0.0, 1.0, 0.0, 1.0, ..., 0.5]
Final = [0.2, 0.5, 0.8, 1.0, ..., 0.6]
Instead of one attention mechanism, we have h different “heads” running in parallel.
Concept:
Input: Query (Q), Key (K), Value (V) matrices
Head 1: ScaledDotProductAttention(Q₁, K₁, V₁)
Head 2: ScaledDotProductAttention(Q₂, K₂, V₂)
...
Head h: ScaledDotProductAttention(Qₕ, Kₕ, Vₕ)
Output = Concatenate(Head₁, Head₂, ..., Headₕ)
Why Multiple Heads?
Typical Configuration:
Output = Input + Attention(Input)
This helps with gradient flow during training and allows networks to go deeper.
Normalized = (x - mean) / sqrt(variance + epsilon)
Stabilizes training and speeds up convergence.
After attention, there’s a simple 2-layer feed-forward network:
Output = ReLU(Linear₁(x)) → Linear₂
Typical dimensions:
Input: [batch_size, seq_length, 512]
↓ Linear₁ (512 → 2048)
[batch_size, seq_length, 2048]
↓ ReLU (non-linear)
[batch_size, seq_length, 2048]
↓ Linear₂ (2048 → 512)
[batch_size, seq_length, 512]
This expands then contracts, allowing for non-linear transformations.
Single Encoder Layer:
Input (x)
↓
[Multi-Head Self-Attention]
↓
[+ Residual Connection with input]
↓
[Layer Normalization]
↓
[Feed-Forward Network]
↓
[+ Residual Connection]
↓
[Layer Normalization]
↓
Output
Key Point: In encoder, each word attends to ALL words (including itself) in the same sentence.
Encoder gives: Contextual representation of each word, considering all other words.
Decoder is similar but with masking:
Input (shifted right by 1)
↓
[Masked Multi-Head Self-Attention] ← Can only attend to previous positions
↓
[+ Residual + LayerNorm]
↓
[Encoder-Decoder Attention] ← Attends to encoder output
↓
[+ Residual + LayerNorm]
↓
[Feed-Forward Network]
↓
[+ Residual + LayerNorm]
↓
Output
Three Attention Mechanisms in Decoder:
Masked Self-Attention:
Encoder-Decoder Attention:
Feed-Forward:
Input: "Je suis heureux" (French)
Target: "I am happy" (English)
Encoder Input:
- Tokenize: [Je, suis, heureux]
- Embed each token
- Add positional encoding
- Process through N encoder layers
→ Output: C (context vectors)
Decoder Input:
- Target shifted right: [<START>, I, am]
- Embed each token
- Add positional encoding
- Process through N decoder layers
- Using masked self-attention
- Using encoder-decoder attention on C
→ Output logits for each position
Loss: Compare predicted "am happy" with actual "am happy"
Backprop: Update all weights
Encoder Input: [Je, suis, heureux]
→ Output: C (context vectors)
Decoder:
Step 1: Start with [<START>]
Predict next word: "I"
Step 2: [<START>, I]
Predict next word: "am"
Step 3: [<START>, I, am]
Predict next word: "happy"
Step 4: [<START>, I, am, happy]
Predict next word: <END>
Output: "I am happy"
| Feature | Benefit |
|---|---|
| No RNN | Fully parallelizable - train on GPUs efficiently |
| Self-Attention in Encoder | Each word gets context from ALL other words |
| Masked Attention in Decoder | Can’t see future - trains with autoregressive generation |
| Positional Encoding | Preserves word order without RNN sequential processing |
| Multi-Head Attention | Learn multiple types of relationships simultaneously |
| Residual Connections | Gradient flow - enables training deep networks |
| Layer Normalization | Stability - faster convergence |