Date: 2025-11-05 (Wednesday)
Status: “Done”
This is the heart and soul of transformers. Understanding this deeply is critical.
(Q × K^T)
Attention(Q, K, V) = softmax(─────────) × V
√(d_k)
Where:
Let’s compute attention for a simple example:
Input Sentence: “I am happy”
Step 1: Create Word Embeddings
I: [0.1, 0.2, 0.3]
am: [0.4, 0.5, 0.6]
happy: [0.7, 0.8, 0.9]
Step 2: Convert to Q, K, V In practice, we learn linear projections:
Q = Embedding × W_q
K = Embedding × W_k
V = Embedding × W_v
For simplicity, let’s say:
Q = [0.1, 0.2, 0.3] K = [0.1, 0.2, 0.3] V = [0.1, 0.2, 0.3]
[0.4, 0.5, 0.6] [0.4, 0.5, 0.6] [0.4, 0.5, 0.6]
[0.7, 0.8, 0.9] [0.7, 0.8, 0.9] [0.7, 0.8, 0.9]
(In reality, Q, K, V would be different projections, but this shows the concept)
Step 3: Compute Q × K^T (dot products)
Q × K^T = [0.1, 0.2, 0.3] [0.1, 0.4, 0.7]
[0.4, 0.5, 0.6] × [0.2, 0.5, 0.8]
[0.7, 0.8, 0.9] [0.3, 0.6, 0.9]
Q₁·K₁ = 0.1×0.1 + 0.2×0.2 + 0.3×0.3 = 0.01 + 0.04 + 0.09 = 0.14
Q₁·K₂ = 0.1×0.4 + 0.2×0.5 + 0.3×0.6 = 0.04 + 0.10 + 0.18 = 0.32
Q₁·K₃ = 0.1×0.7 + 0.2×0.8 + 0.3×0.9 = 0.07 + 0.16 + 0.27 = 0.50
Result matrix:
[0.14, 0.32, 0.50]
[0.32, 0.77, 1.22]
[0.50, 1.22, 1.94]
Interpretation:
Step 4: Scale by √d_k
d_k = 3 (embedding dimension), so √d_k = √3 ≈ 1.73
Scaled matrix = Q×K^T / √3:
[0.14/1.73, 0.32/1.73, 0.50/1.73] [0.08, 0.18, 0.29]
[0.32/1.73, 0.77/1.73, 1.22/1.73] = [0.18, 0.44, 0.70]
[0.50/1.73, 1.22/1.73, 1.94/1.73] [0.29, 0.70, 1.12]
Why scale?
Step 5: Apply Softmax
Softmax converts scores to probabilities (sum to 1):
softmax(x) = exp(x) / sum(exp(x))
For first row [0.08, 0.18, 0.29]:
exp(0.08) ≈ 1.083
exp(0.18) ≈ 1.197
exp(0.29) ≈ 1.336
Sum ≈ 3.616
Probabilities:
[1.083/3.616, 1.197/3.616, 1.336/3.616] ≈ [0.30, 0.33, 0.37]
All three rows:
Softmax weights (attention matrix):
[0.30, 0.33, 0.37]
[0.26, 0.37, 0.37]
[0.25, 0.36, 0.39]
Interpretation:
Step 6: Multiply by Value Matrix (V)
Context = Softmax_weights × V
Context(for "I"): 0.30×[0.1,0.2,0.3] + 0.33×[0.4,0.5,0.6] + 0.37×[0.7,0.8,0.9]
= [0.03,0.06,0.09] + [0.13,0.17,0.20] + [0.26,0.30,0.33]
= [0.42, 0.53, 0.62]
Context(for "am"): 0.26×[0.1,0.2,0.3] + 0.37×[0.4,0.5,0.6] + 0.37×[0.7,0.8,0.9]
= [0.026,0.052,0.078] + [0.148,0.185,0.222] + [0.259,0.296,0.333]
= [0.433, 0.533, 0.633]
Context(for "happy"): 0.25×[0.1,0.2,0.3] + 0.36×[0.4,0.5,0.6] + 0.39×[0.7,0.8,0.9]
= [0.025,0.05,0.075] + [0.144,0.18,0.216] + [0.273,0.312,0.351]
= [0.442, 0.542, 0.642]
Output Context Matrix:
[0.42, 0.53, 0.62]
[0.433, 0.533, 0.633]
[0.442, 0.542, 0.642]
Each word now has a context vector that combines information from all words weighted by attention scores!
| Aspect | Reason |
|---|---|
| Dot Product | Efficient similarity measure (just matrix multiplication) |
| Scaling | Prevents softmax saturation (keeps gradients healthy) |
| Softmax | Converts similarities to normalized weights [0,1] |
| Multiply by V | Gets actual information (weighted combination) |
Instead of one attention head, we use h = 8 (or more) attention heads:
MultiHeadAttention(Q, K, V) = Concat(Head₁, ..., Head₈) × W_o
Where:
Head_i = Attention(Q × W_q^i, K × W_k^i, V × W_v^i)
Different heads learn different relationships:
Example:
Sentence: "The quick brown fox jumps over the lazy dog"
Head 1 (subject-verb):
- "fox" → "jumps": 0.9
- "dog" → has_property: 0.7
Head 2 (adjective-noun):
- "quick" → "fox": 0.85
- "brown" → "fox": 0.8
- "lazy" → "dog": 0.9
Head 3 (spatial):
- "over" → connects "fox" and "dog": 0.8
All these different perspectives combined give rich contextual understanding.
Why is scaled dot-product attention so efficient?
Comparison:
✅ Attention is learned: W_q, W_k, W_v are trainable parameters ✅ Position-free: No sequential dependencies - can attend across any distance ✅ Interpretable: Can visualize which words attend to which ✅ Efficient: Uses only matrix operations (GPU-friendly)
Now that we understand scaled dot-product attention, we’ll explore: