Detecting High-Stakes Interactions using Activation Probes

This repo contains the code for all the experiments in the paper "Detecting High-Stakes Interactions using Activation Probes" (arxiv), presented at the ICML 2025 Workshop on Actionable Interpretability, submitted to NeurIPS 2025.

There's a lot of good stuff in here but it's in a bit of a rough state. For a cleaner codebase that's easier to use, check out TuberLens.

Setup

In order to run this code:

Install uv and run uv sync
Create a cloudflare account and create an R2 bucket to store datasets & activations
Add a .env file to the project root with the following environment variables:

OPENAI_API_KEY=
OPEN_ROUTER_API_KEY=
HF_TOKEN=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_DATASETS_BUCKET=
R2_ACTIVATIONS_BUCKET=
R2_ACCOUNT_ID=
ACTIVATIONS_DIR=
HF_HOME=
WANDB_API_KEY=

Activations

In order to train or run inference with probes, you'll need to compute and store activations. You can do that using the mup acts store command. Here is an example:

uv run mup acts store --model 'meta-llama/Llama-3.2-1B-Instruct' --layer 11 --dataset data/training/prompts_4x/train.jsonl

This will compute activations, save them locally to ACTIVATIONS_DIR, and upload them to R2_ACTIVATIONS_BUCKET.

Datasets

We contribute a new synthetic dataset we use for training, as well as slightly modified external datasets labelled for stakes we use for evaluation.

Our datasets can be found here:

Dataset Name	Balanced	Raw
Training	train; test	-
Anthropic HH	dev; test	dev; test
MT	dev; test	dev; test
MTS	dev; test	dev; test
Toolace	dev; test	dev; test
Mental Health	test	test
Aya Redteaming	test	test

Before running any experiments, make sure to download the datasets. Put eval datasets into data/evals/dev/ and data/evals/test/ respectively, and training data into data/training/prompts_4x/. (You can also use other paths if you adjust the configuration in config/eval_datasets/ and SYNTHETIC_DATASET_PATH in src/models_under_pressure/config.py.)

Below you can also find information on how these datasets were generated.

Generating Synthetic Training Data

The code for generating the synthetic dataset can be found in models_under_pressure/dataset_generation: in particular, to generate a synthetic dataset, run situation_generation.py, then prompt_generation.py.

The code used for filtering confounding samples is in the filter_dataset function in models_under_pressure/scripts/analyse_confounders.py.

Generating Dev Datasets for Evaluation

Run the files anthropic_dataset.py, mt_dataset.py, mts_dataset.py and toolace_dataset.py from src/models_under_pressure/eval_datasets/. That will create the corresponding dataset files (raw and balanced) in the dev evals directory (see config.py).

Run the script label_distribution.py from src/models_under_pressure/scripts/ to see number of high-stakes/low-stakes/ambiguous for all eval datasets.

Warning: Creating further dev samples when test datasets are already present can lead to overlap between dev and test. Ideally, first generate the full dev datasets and then create the test datasets to avoid overlap.

Generating Test Datasets for Evaluation

Run the files anthropic_dataset.py, mt_dataset.py, mts_dataset.py and toolace_dataset.py from src/models_under_pressure/eval_datasets/, using the --split=test argument. That will create the corresponding dataset files (raw and balanced) in the test evals directory (see config.py).

After that, run the script eval_dataset_split_check.py to ensure that there is no overlap between dev and test datasets. (Note that MT does have duplicates, so for that dataset you can expect some overlap by default.)

Notes on Dataset Versions

Apr 15 versions (all based on calling parts of modify_dataset.py):

ToolACE dev and test datasets: Based on previous dev dataset (raw version), modify the system prompt and relabel after adding system prompt.
Anthropic dev and test datasets: Adding system prompt to each sample, otherwise no changes.
MT dev and test datasets: Adding system prompt to each sample, removing cases where transcription length is less than description length, and adding more info to input.

Apr 16 versions (all based on calling parts of modify_dataset.py)

MTS dev and test datasets: Parsing conversations (using strict mode), adding system prompt and relabelling.
MT dev and test datasets: Adding system prompt to each sample, adding more info to input and relabelling.
Anthropic dev and test datasets: Adding system prompt to each sample and relabelling.

Apr 22 versions (all based on calling parts of modify_dataset.py)

MTS dev and test datasets: Same as before with slightly modified system prompt (fixed typo and dropping "guest families" part).
ToolACE dev and test datasets: Based on original dataset, modify the system prompt (only changing first sentence and removing a later confusing sentence) and relabel after adding system prompt.
Aya Redteaming dataset (only test): Added a system prompt and relabelled.
Mental health dataset (only test): Added a system prompt and relabelled.

Apr 23 version of Anthropic (dev and test): Remove the duplicate system prompt, relabelling again just in case.

Apr 30 version of MT (dev and test): Moved 350 samples from test raw to dev raw and resampled the balanced versions.

Deployment Context Datasets

Medical deployment dataset:

Pair IDs up to 60 were generated using Gemini 2.5 Pro
Additional pairs were created with GPT 4.5, giving the pairs from Gemini as examples
Script create_deployment_datasets.py was used to convert into proper Dataset and relabel (which led to removal of many items)

Software deployment dataset:

All items generated with GPT 4.5
Script create_deployment_datasets.py was used to convert into proper Dataset and relabel (which led to removal of many items)

Chatbot deployment dataset:

All items generated with GPT 4.5
Script create_deployment_datasets.py was used to convert into proper Dataset and relabel (which led to removal of many items)

Combined deployment dataset: Created by concatenating all previous datasets.

Running Experiments

Probe Comparison

Run scripts/compare_probes.sh to train probes and run prediction against all evaluation datasets
Run compare_probes_plot.py to generate the bar chart

Probe Calibration

Run experiments/calibration.py, modifying the paths in the if __name == '__main__' block to point to the relevant files generated by compare_probes.sh

Probe Visualisation

To generate the per-token probe scores for visualisation, run:

uv run mup exp +experiment=evaluate_probe probe=attention +id=attn_viz_dev

Then to visualise the probe scores in the dashboard, run:

uv run mup dashboard data/results/evaluate_probes/results_attn_viz_dev.jsonl

Compare Probes to Baselines Plot

To generate the bar charts comparing probes and baselines, run models_under_pressure/figures/probes_vs_baselines_plot.py. This should only be run after "Compare Probes" has been run, and after all baselines have been run. The results files should be adjusted accordingly.

Comparing Probes to Word Statistics Baselines

To generate the plot comparing Attention probe with a TF-IDF based classifier, run models_under_pressure/scripts/analyse_confounders.py.

Cross Validation

To run cross-validation and discover which layer has the best accuracy, run uv run mup exp +experiment=cv.

Generalisation Heatmaps

To generate the data for the generalisation heatmaps, run uv run mup exp +experiment=generalisation_heatmap.

Then to create the plots, run models_under_pressure/figures/generalisation_heatmap_plot.py, changing the paths to the datafiles if necessary.

Training on Dev Split of Evaluation Datasets

Run experiments/dev_split_training.py for the best probe with different settings of dev_sample_usage. The script computes results 5 times by default with the same settings.
- Important: Set gradient_accumulation_steps to 1 in the config of the corresponding probe, since training data for this experiment can consist only of few samples and no learning occurs if number of batches is less than gradient accumulation steps.
Run figures/dev_split_training_plot.py to generate the corresponding plot. Adjust file paths end of the file before.
- If you want to include the line for the baseline, you can obtain the corresponding file from the cascade experiment.

Data Efficiency Experiment

Code for running the data efficiency experiment is included in experiments/data_efficiency.py. You can run that file directly after adjusting the configurations at the end of it and calling one of the functions defined in that file:

Use the function run_data_efficiency_experiment to get results for different types of probes.
Use the function run_data_efficiency_finetune_baseline_with_activations to compute results for the finetuned baselines. (Adjust config accordingly and run one baseline model at a time.)

Generate the plot by putting all results files into a single directory and calling the script figures/plot_data_efficiency.py.

Cascade Plot

To generate finetuning results, run notebooks/finetuning_for_cascade.py (adjust the settings in that script depending on the model you want to finetune)
To generate the other results, run experiments/monitoring_cascade.py. The corresponding configuration files can be found under config/experiments/monitoring_cascade.yaml and config/experiments/monitoring_cascade/. It has one part for computing the results and a second part to generate the plot based on the results.
- Result generation: The script generates result files for the selected probe and the continuation baselines.
- Plot generation: Make sure that all the relevant files are included in one directory. This typically involves moving the finetuned baseline results into the directory with the other results. Then run the analysis step of the script.
- For generating the full cascade plot (appendix), make sure that in analyze.yaml, the baseline_models and finetuned_baseline_models selections are both set to null, so that all results are displayed. Also, you might want to tweak a few arguments of the plotting function such as reducing y_lim.

Figure 1 Plot

The script to generate this plot uses outputs from the cascade experiment. Run the cascade experiment to compute the full results (e.g. using null for model selections in analyze.yaml)
Then run figures/plot_method_comparison.py

Computing Baselines

Prompted Baselines

Run uv run mup exp +experiment=run_baselines model=<MODEL> (replacing <MODEL> by "llama-1b", "llama-70b", "gemma-1b" etc.) to generate the results of the respective prompted model on all dev datasets (make sure default for eval_datasets in config/config.yaml is set to dev_balanced) for all prompt templates. All results are written in JSONL format to a single results file.

Finetuned Baselines

Run models_under_pressure/baselines/finetune.py (replacing the finetune_model variable in run_finetune_baselines() by "llama-1b", "llama-8b", "gemma-1b", etc.) to finetune the respective model on the synthetic dataset and generate the results on all dev datasets (change EVAL_DATASETS to TEST_DATASETS in the call to get_finetuned_baseline_results to instead get results on test datasets).

Name		Name	Last commit message	Last commit date
Latest commit History 1,532 Commits
.dvc		.dvc
.github/workflows		.github/workflows
.vscode		.vscode
config		config
data		data
notebooks		notebooks
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
training_distribution.py		training_distribution.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detecting High-Stakes Interactions using Activation Probes

Setup

Activations

Datasets

Generating Synthetic Training Data

Generating Dev Datasets for Evaluation

Generating Test Datasets for Evaluation

Notes on Dataset Versions

Deployment Context Datasets

Running Experiments

Probe Comparison

Probe Calibration

Probe Visualisation

Compare Probes to Baselines Plot

Comparing Probes to Word Statistics Baselines

Cross Validation

Generalisation Heatmaps

Training on Dev Split of Evaluation Datasets

Data Efficiency Experiment

Cascade Plot

Figure 1 Plot

Computing Baselines

Prompted Baselines

Finetuned Baselines

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

Arrrlex/models-under-pressure

Folders and files

Latest commit

History

Repository files navigation

Detecting High-Stakes Interactions using Activation Probes

Setup

Activations

Datasets

Generating Synthetic Training Data

Generating Dev Datasets for Evaluation

Generating Test Datasets for Evaluation

Notes on Dataset Versions

Deployment Context Datasets

Running Experiments

Probe Comparison

Probe Calibration

Probe Visualisation

Compare Probes to Baselines Plot

Comparing Probes to Word Statistics Baselines

Cross Validation

Generalisation Heatmaps

Training on Dev Split of Evaluation Datasets

Data Efficiency Experiment

Cascade Plot

Figure 1 Plot

Computing Baselines

Prompted Baselines

Finetuned Baselines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages