Week 10 - Transfer Learning, BERT & T5
Week: 2025-11-10 to 2025-11-14
Status: “In progress”
Week 10 Overview
This week dives into transfer learning for NLP and how modern QA systems benefit from pre-trained transformers. We will contrast classical training with feature-based reuse and fine-tuning, then study two flagship models: BERT for bidirectional context and T5 for text-to-text multitask learning, along with practical QA setups (context-based vs. closed-book).
Key Topics
Transfer Learning Fundamentals
- Classical vs. transfer learning pipelines
- Reusing pre-trained weights to speed convergence
- Feature-based representations vs. fine-tuning
- Benefits: faster training, better predictions, less labeled data
Question Answering Modes
- Context-based QA (span extraction with provided context)
- Closed-book QA (generate answers without context)
- How pre-training quality shapes QA performance
BERT Bidirectional Context
- Masked language modeling for contextual embeddings
- Next sentence prediction for sentence-level coherence
- Using both left and right context to predict tokens
- Typical downstream uses: QA, sentiment, classification
T5 Text-to-Text Multitask
- Unified text-to-text framing for multiple tasks
- Prompting the same model for rating, QA, summarization, translation
- Scaling with large corpora (e.g., C4 vs. Wikipedia)
- Multitask transfer to improve generalization
Training & Data Strategy
- Labeled vs. unlabeled data mix; self-supervised masking
- Freezing backbone vs. adding task heads
- Fine-tuning recipes for downstream tasks (QA, summarization, translation)
Learning Objectives
- ✅ Explain transfer learning and when to prefer it over training from scratch
- ✅ Distinguish feature-based reuse from full fine-tuning
- ✅ Compare context-based QA and closed-book QA setups
- ✅ Summarize how BERT and T5 pre-train and transfer across tasks
- ✅ Identify why transfer learning reduces data needs and training time
Daily Breakdown
| Day |
Focus |
Topics |
| 46 |
Transfer Learning Intro |
Classical vs. transfer pipeline, reuse weights, feature-based vs. fine-tuning, benefits |
| 47 |
Question Answering |
Context-based span QA vs. closed-book QA, data needs, evaluation cues |
| 48 |
BERT Bidirectionality |
Masked LM, next sentence prediction, leveraging both contexts for token prediction |
| 49 |
T5 Multitask Model |
Text-to-text prompts, multitask sharing, scaling data (C4 vs. Wikipedia) |
| 50 |
Fine-tuning Practice |
Freezing layers vs. adding heads, downstream tasks: QA, summarization, translation |
Prerequisites
- Solid grasp of transformer architecture from Week 9
- Comfortable with attention mechanisms and encoder-decoder flow
- Basic familiarity with PyTorch or TensorFlow for fine-tuning
Next Steps
- Read the BERT and T5 papers to internalize pre-training objectives
- Fine-tune a pre-trained BERT QA model (e.g., SQuAD-style span extraction)
- Experiment with T5 prompts for QA, summarization, and sentiment tasks
- Compare feature-based vs. fine-tuned performance on your own dataset