Skip to content

MAIA-admin/mit-ai-alignment.github.io

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

MIT IAP AI Safety Class

Course Overview

Description: MIT's introductory course on AI safety, focusing on empirical ML that helps mitigate catastrophic risks from AI. Topics include reinforcement learning (from human feedback), jailbreaking large language models, transformer circuits, superposition of neural networks, and detecting deception in ML models. Gives exposure to foundational results as well as cutting-edge results from this emerging field. The class will have two labs, where instructors will guide students through implementation of techniques taught in lectures.

Prerequisites: 6:3900 (6.036) or equivalent.

Instructors: Eric Gan, Eleni Shor, Julian Yocum

Sign Up Here!

Logistics

  • Dates: Weeks of 1-15-24 and 1-22-24
  • Classes: Monday, Tuesday, Wednesday from 3 - 4:30 PM
  • Labs: Thursday from 2 - 5 PM
  • Room: 36-112 (both lectures and labs)
  • Google Calendar Link

Schedule

Date Time Topic Material
Mon 1-15 3 - 4:30 PM Lecture 1: Reinforcement Learning Slides
Tue 1-16 3 - 4:30 PM Lecture 2: CANCELLED
Wed 1-17 3 - 4:30 PM Lecture 3: Language Model Alignment Slides
Thu 1-18 2 - 5 PM Lab 1 • PyTorch basics
• Multi-armed bandits
• Deep Q-Learning
Mon 1-22 3 - 4:30 PM Lecture 4: Transformers • Transformer architecture
• Induction heads
• Transformer circuits
Tue 1-23 3 - 4:30 PM Lecture 5: Model Internals • Feature visualization
• Superposition
• Sparse autoencoders
Wed 1-24 3 - 4:30 PM Lecture 6: Scalable Safety • Scaling laws and emergence
• Model evaluations
• Detecting deception
Thu 1-25 2 - 5 PM Lab 2 • Build a transformer
• Sparse autoencoders
• Interpretability of modular addition

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors