Practical tools for mechanistic interpretability of neural networks. Built for AI safety researchers who need to understand what's happening inside language models.
Mechanistic interpretability is one of the most promising approaches to AI safety. If we can understand the internal computations of neural networks — the actual algorithms they implement — we can identify potential failure modes before deployment and build more trustworthy systems.
This toolkit provides composable, well-tested building blocks for interpretability research, focused on transformer-based language models.
Extract, cache, and analyze intermediate activations from any layer of a transformer. Supports residual stream, attention patterns, and MLP activations.
from interp_toolkit import ActivationCache
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
cache = ActivationCache(model)
activations = cache.run("The capital of France is")
residual = activations.residual_stream(layer=6) # (seq_len, d_model)
attn_pattern = activations.attention_pattern(layer=6, head=3) # (seq_len, seq_len)Train linear probes on model internals to detect learned features — sentiment, factuality, named entities, syntactic structure, or custom concepts.
from interp_toolkit.probes import LinearProbe
probe = LinearProbe(input_dim=768, num_classes=2)
probe.train(
activations=residual_stream_data,
labels=factuality_labels,
epochs=50,
)
accuracy = probe.evaluate(test_activations, test_labels)
# accuracy: 0.94 — the model represents factuality at layer 6Causally intervene on model internals by patching activations from one forward pass into another. Essential for understanding which components are necessary and sufficient for a behavior.
from interp_toolkit.circuits import activation_patch
result = activation_patch(
model=model,
clean_input="The Eiffel Tower is in",
corrupt_input="The Colosseum is in",
target_token="Paris",
patch_layer=8,
patch_component="mlp",
)
print(f"Logit difference change: {result.logit_diff_change:.3f}")Automatically identify minimal circuits responsible for specific behaviors using iterative ablation and path patching.
from interp_toolkit.circuits import CircuitFinder
finder = CircuitFinder(model, threshold=0.01)
circuit = finder.find_circuit(
clean_inputs=["The doctor said she", "The nurse said he"],
corrupt_inputs=["The doctor said he", "The nurse said she"],
target_metric="logit_diff",
)
print(circuit.summary())
# Circuit: 5 attention heads, 2 MLP layers
# Key heads: L5H1 (name mover), L7H3 (subject tracker)Interactive visualizations for attention patterns, activation distributions, and circuit diagrams.
from interp_toolkit.visualization import plot_attention, plot_circuit
plot_attention(attn_pattern, tokens=["The", "capital", "of", "France", "is"])
plot_circuit(circuit, output="circuit_diagram.html")interp_toolkit/
├── activations/ # Activation extraction and caching
│ ├── cache.py # Hook-based activation capture
│ └── store.py # Disk-backed activation storage
├── probes/ # Linear and nonlinear probing
│ ├── linear.py # Linear probe implementation
│ └── trainer.py # Probe training utilities
├── circuits/ # Circuit analysis tools
│ ├── patching.py # Activation and path patching
│ ├── ablation.py # Ablation studies
│ └── finder.py # Automatic circuit discovery
└── visualization/ # Plotting and interactive vis
├── attention.py # Attention pattern plots
└── circuits.py # Circuit diagram generation
pip install interpretability-toolkitFor GPU support:
pip install interpretability-toolkit[gpu]This toolkit has been used for:
- Factuality circuits: Identifying which attention heads track factual associations
- Sycophancy mechanisms: Locating components that cause models to agree with users
- Refusal circuits: Understanding how safety training modifies model internals
- Capability elicitation: Finding latent capabilities through activation steering
@software{calkin2026interptoolkit,
title={interpretability-toolkit: Practical Tools for Mechanistic Interpretability},
author={Calkin, Maxwell},
year={2026},
url={https://github.com/MaxwellCalkin/interpretability-toolkit}
}MIT License. See LICENSE.