Day 50 - Fine-Tuning Practice
Date: 2025-11-14 (Friday)
Status: “Done”
Fine-Tuning Recipes
Today focuses on practical knobs: which layers to freeze, how to schedule learning rates, and how to evaluate transfer setups.
Freezing vs. Training
- Freeze lower layers when data is small or domain is close to pre-training.
- Unfreeze progressively (top -> bottom) if accuracy plateaus.
- Add a task-specific head (classification, span QA, seq2seq) and start there.
Hyperparameter Basics
- Learning rate: 1e-5 to 3e-5 for full encoder; slightly higher if mostly frozen.
- Warmup steps: ~5% to 10% of total steps to stabilize early training.
- Max sequence length: match task; chunk long docs for QA.
Evaluation Loop
- Track loss + task metrics (EM/F1 for QA, ROUGE for summarization, accuracy/F1 for classification).
- Early stop on dev set; keep best checkpoint, not just last.
- Compare feature-based baseline vs. full fine-tune for a small subset.
Deployment Considerations
- Distill or quantize for latency if accuracy holds.
- Cache tokenizer and truncation rules to avoid drift between train and serve.
- Log prompts/inputs to debug closed-book vs. context-based behaviors.
Practice Targets for Today
- Run (or plan) a small grid: learning rate, freezing strategy, max length.
- Evaluate on a held-out set and record EM/F1 or ROUGE.
- Decide post-training steps: distillation, quantization, or caching.