Skip to content

MaxwellCalkin/interpretability-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

interpretability-toolkit

Practical tools for mechanistic interpretability of neural networks. Built for AI safety researchers who need to understand what's happening inside language models.

Motivation

Mechanistic interpretability is one of the most promising approaches to AI safety. If we can understand the internal computations of neural networks — the actual algorithms they implement — we can identify potential failure modes before deployment and build more trustworthy systems.

This toolkit provides composable, well-tested building blocks for interpretability research, focused on transformer-based language models.

Capabilities

Activation Analysis

Extract, cache, and analyze intermediate activations from any layer of a transformer. Supports residual stream, attention patterns, and MLP activations.

from interp_toolkit import ActivationCache
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
cache = ActivationCache(model)

activations = cache.run("The capital of France is")
residual = activations.residual_stream(layer=6)  # (seq_len, d_model)
attn_pattern = activations.attention_pattern(layer=6, head=3)  # (seq_len, seq_len)

Linear Probes

Train linear probes on model internals to detect learned features — sentiment, factuality, named entities, syntactic structure, or custom concepts.

from interp_toolkit.probes import LinearProbe

probe = LinearProbe(input_dim=768, num_classes=2)
probe.train(
    activations=residual_stream_data,
    labels=factuality_labels,
    epochs=50,
)
accuracy = probe.evaluate(test_activations, test_labels)
# accuracy: 0.94 — the model represents factuality at layer 6

Activation Patching

Causally intervene on model internals by patching activations from one forward pass into another. Essential for understanding which components are necessary and sufficient for a behavior.

from interp_toolkit.circuits import activation_patch

result = activation_patch(
    model=model,
    clean_input="The Eiffel Tower is in",
    corrupt_input="The Colosseum is in",
    target_token="Paris",
    patch_layer=8,
    patch_component="mlp",
)
print(f"Logit difference change: {result.logit_diff_change:.3f}")

Circuit Discovery

Automatically identify minimal circuits responsible for specific behaviors using iterative ablation and path patching.

from interp_toolkit.circuits import CircuitFinder

finder = CircuitFinder(model, threshold=0.01)
circuit = finder.find_circuit(
    clean_inputs=["The doctor said she", "The nurse said he"],
    corrupt_inputs=["The doctor said he", "The nurse said she"],
    target_metric="logit_diff",
)
print(circuit.summary())
# Circuit: 5 attention heads, 2 MLP layers
# Key heads: L5H1 (name mover), L7H3 (subject tracker)

Visualization

Interactive visualizations for attention patterns, activation distributions, and circuit diagrams.

from interp_toolkit.visualization import plot_attention, plot_circuit

plot_attention(attn_pattern, tokens=["The", "capital", "of", "France", "is"])
plot_circuit(circuit, output="circuit_diagram.html")

Architecture

interp_toolkit/
├── activations/       # Activation extraction and caching
│   ├── cache.py       # Hook-based activation capture
│   └── store.py       # Disk-backed activation storage
├── probes/            # Linear and nonlinear probing
│   ├── linear.py      # Linear probe implementation
│   └── trainer.py     # Probe training utilities
├── circuits/          # Circuit analysis tools
│   ├── patching.py    # Activation and path patching
│   ├── ablation.py    # Ablation studies
│   └── finder.py      # Automatic circuit discovery
└── visualization/     # Plotting and interactive vis
    ├── attention.py   # Attention pattern plots
    └── circuits.py    # Circuit diagram generation

Installation

pip install interpretability-toolkit

For GPU support:

pip install interpretability-toolkit[gpu]

Research Applications

This toolkit has been used for:

  • Factuality circuits: Identifying which attention heads track factual associations
  • Sycophancy mechanisms: Locating components that cause models to agree with users
  • Refusal circuits: Understanding how safety training modifies model internals
  • Capability elicitation: Finding latent capabilities through activation steering

Citation

@software{calkin2026interptoolkit,
  title={interpretability-toolkit: Practical Tools for Mechanistic Interpretability},
  author={Calkin, Maxwell},
  year={2026},
  url={https://github.com/MaxwellCalkin/interpretability-toolkit}
}

License

MIT License. See LICENSE.

About

Practical mechanistic interpretability tools — activation caching, linear probes, activation patching, circuit discovery, and visualization for transformer models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages