Description: MIT's introductory course on AI safety, focusing on empirical ML that helps mitigate catastrophic risks from AI. Topics include reinforcement learning (from human feedback), jailbreaking large language models, transformer circuits, superposition of neural networks, and detecting deception in ML models. Gives exposure to foundational results as well as cutting-edge results from this emerging field. The class will have two labs, where instructors will guide students through implementation of techniques taught in lectures.
Prerequisites: 6:3900 (6.036) or equivalent.
Instructors: Eric Gan, Eleni Shor, Julian Yocum
- Dates: Weeks of 1-15-24 and 1-22-24
- Classes: Monday, Tuesday, Wednesday from 3 - 4:30 PM
- Labs: Thursday from 2 - 5 PM
- Room: 36-112 (both lectures and labs)
- Google Calendar Link
| Date | Time | Topic | Material |
|---|---|---|---|
| Mon 1-15 | 3 - 4:30 PM | Lecture 1: Reinforcement Learning | Slides |
| Tue 1-16 | 3 - 4:30 PM | Lecture 2: CANCELLED | |
| Wed 1-17 | 3 - 4:30 PM | Lecture 3: Language Model Alignment | Slides |
| Thu 1-18 | 2 - 5 PM | Lab 1 | • PyTorch basics • Multi-armed bandits • Deep Q-Learning |
| Mon 1-22 | 3 - 4:30 PM | Lecture 4: Transformers | • Transformer architecture • Induction heads • Transformer circuits |
| Tue 1-23 | 3 - 4:30 PM | Lecture 5: Model Internals | • Feature visualization • Superposition • Sparse autoencoders |
| Wed 1-24 | 3 - 4:30 PM | Lecture 6: Scalable Safety | • Scaling laws and emergence • Model evaluations • Detecting deception |
| Thu 1-25 | 2 - 5 PM | Lab 2 | • Build a transformer • Sparse autoencoders • Interpretability of modular addition |