Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning
This is the official PyTorch implement of the paper "Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning" at TMM 2025.
pip install -r requirment.txt
- Dataset
- Dataset json files for downstream tasks: ALBEF github
- Victim Models
- Download MSCOCO or Flickr30k datasets from origin website.
- For text augmentation, round-trip translation is employed to generate diverse textual variations. For image transformation, a combination of techniques—including rotation, polarization, translation, shear, color jittering, and cropping—is applied to enhance data diversity.
- Run
# For instance, we utilize multimodal adversarial examples generated by CLIP to evaluate the target models, ALBEF and TCL, in a transfer-based setting on the FLICKR30K dataset.
python ./Retrieval/CLIP/eval_clip2albef_flickr.py
For the visual entailment task, the victim models and dataset can be replaced to evaluate the adversarial robustness of VLP models on the SNLI-VE dataset using the same experimental operation in image-text retrieval.
If you find this code to be useful for your research, please consider citing.
@article{wang2023exploring,
title={Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning},
author={Wang, Youze and Hu, Wenbo and Dong, Yinpeng and Zhang, Hanwang and Su, Hang and Hong, Richang},
journal{IEEE Transactions on Multimedia},
year={2025}
}