This is the official code repository for the paper BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment.
We have released the detailed implementations of BackdoorAlign for the open source model Llama-2 under opensource. A demo example for GPT-3.5 experiments through OpenAI API is shown under openai_api.
To implement the opensource version of BackdoorAlign, please refer to the following instructions to build the conda environment:
conda create -n backdooralign python==3.9
conda activate backdooralign
conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Please use the provided scripts in opensource/scripts to replicate the experiments in various settings with the Attack Success Rate computed. The attacked models are saved in opensource/finetuned_models and you can find the model generations in opensource/safety_evaluation/question_output.
bash run_fjattack.sh # Fine-tuning based Jailbreak Attack
bash run_baseline.sh # Baseline Defense
bash run_backdooralign.sh # BackdoorAlign
Compute the Harmfulness Score with GPT-4 on the generation results with the Python script opensource/safety_evaluation/gpt4_eval.py. Remember to add your OpenAI API key in the script.
python gpt4_eval.py --input_file question_output/YOUR_RESULTS
We evaluate the model accuracy of ARC-Challenge and MMLU with the Language Model Evaluation Harness repository. MT-Bench Score is evaluated with the LLM Judge under the FastChat repository. Remember to include the secret prompt for your BackdoorAlign model evaluation.
We provide a tutorial for implementing BackdoorAlign on GPT-3.5 with OpenAI API in openai_api/BackdoorAlign_demo.ipynb.
Please cite the following preprint when referencing our paper:
@inproceedings{wang2024backdooralign,
title={BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment},
author={Wang, Jiongxiao and Li, Jiazhao and Li, Yiquan and Qi, Xiangyu and Hu, Junjie and Li, Yixuan and McDaniel, Patrick and Chen, Muhao and Li, Bo and Xiao, Chaowei},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}