Day 48 - BERT Bidirectional Context
Date: 2025-11-12 (Wednesday)
Status: “Done”
How BERT Learns
BERT pre-trains with bidirectional context so each token sees both left and right neighbors.
Masked Language Modeling (MLM)
- Randomly mask ~15% tokens; predict the original token.
- Loss encourages contextual embeddings that consider surrounding words.
Input: learning from deep learning is like watching the sunset with my best [MASK]
Target: friend
Next Sentence Prediction (NSP)
- Task: predict if sentence B follows sentence A.
- Helps sentence-level coherence for tasks like QA and classification.
Downstream Use
- Start from pre-trained weights.
- Option A: freeze encoder, train a lightweight head (feature-based).
- Option B: fine-tune encoder + head with a small learning rate.
Tips
- Keep max_seq_length aligned to task data; long docs may need chunking.
- Watch for catastrophic forgetting; gradual unfreezing can help.
- Small batch? Use gradient accumulation to stabilize updates.
Practice Targets for Today
- Prepare a BERT QA fine-tune plan (dataset, max length, lr, epochs).
- Decide whether to freeze lower layers for your data size.
- Add evaluation checkpoints (dev set EM/F1) to catch overfitting early.