Date: 2025-10-31 (Friday)
Status: “Done”
BLEU (Bilingual Evaluation Understudy) is an algorithm designed to evaluate machine translation quality.
Core Concept: Compare candidate translation to one or more reference translations (often human translations)
Score Range: 0 to 1
Example:
Process:
Result: 4/4 = 1.0 (perfect score!)
Problem: This translation is terrible but gets perfect score! A model that outputs common words will score well.
Key Change: After matching a word, exhaust it from references
Same Example:
Result: 2/4 = 0.5 (more realistic!)
❌ Doesn’t consider semantic meaning
❌ Doesn’t consider sentence structure
✅ Still most widely adopted metric despite limitations
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
| Metric | Focus | Calculation |
|---|---|---|
| BLEU | Precision | How many candidate words in reference? |
| ROUGE | Recall | How many reference words in candidate? |
Example:
Process for Reference 1:
ROUGE score for Ref 1: 2/5 = 0.4
If multiple references: Calculate for each, take maximum
Since BLEU = precision and ROUGE = recall, we can calculate F1 score:
Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1 = 2 × (BLEU × ROUGE) / (BLEU + ROUGE)
Example:
Problem: Taking the highest probability word at each step doesn’t guarantee best overall sequence
Solution: Beam search finds most likely sequences over a fixed window
Beam Width (B): Number of sequences to keep at each step
Get probabilities for first word:
Keep top B=2: “I” and “am”
For “I”:
For “am”:
Keep top B=2: “am I” (0.28) and “I am” (0.25)
Continue until all B sequences reach EOS token
Choose sequence with highest overall probability
Advantages:
Disadvantages:
Solution for Long Sequences: Normalize by length: divide probability by number of words
Concept: Generate multiple samples and find consensus
Create ~30 random samples from the model
For each sample, compare against all others using similarity metric (e.g., ROUGE)
For each candidate, compute average similarity with all other candidates
Choose the sample with highest average similarity (lowest risk)
E* = argmax_E [ average ROUGE(E, E') for all E' ]
Where:
Step 1: Calculate pairwise ROUGE scores
Step 2: Repeat for C2, C3, C4
Step 3: Select highest
Advantages:
Disadvantages:
When to Use:
| Method | Description | Pros | Cons |
|---|---|---|---|
| Greedy | Pick highest prob at each step | Fast, simple | Suboptimal sequences |
| Beam Search | Keep top-B sequences | Better quality | Memory + compute cost |
| Random Sampling | Sample from distribution | Diverse outputs | Inconsistent quality |
| MBR | Consensus from samples | High quality | Very expensive |
| Metric | Type | Focus | Best For |
|---|---|---|---|
| BLEU | Precision | Candidate → Reference | General MT |
| ROUGE | Recall | Reference → Candidate | Summarization |
| F1 | Harmonic Mean | Both precision & recall | Balanced view |
Critical Note: All these metrics:
Modern Alternative: Use neural metrics or human evaluation for critical applications!