Skip to content

wangyouze/VLP-attack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning

This is the official PyTorch implement of the paper "Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning" at TMM 2025.

Dependencies

pip install -r requirment.txt

Usage

Evaluation

Image-Text Retrieval Task

  1. Download MSCOCO or Flickr30k datasets from origin website.
  2. For text augmentation, round-trip translation is employed to generate diverse textual variations. For image transformation, a combination of techniques—including rotation, polarization, translation, shear, color jittering, and cropping—is applied to enhance data diversity.
  3. Run
# For instance, we utilize multimodal adversarial examples generated by CLIP to evaluate the target models, ALBEF and TCL, in a transfer-based setting on the FLICKR30K dataset.

python ./Retrieval/CLIP/eval_clip2albef_flickr.py

Visual Entailment Task

For the visual entailment task, the victim models and dataset can be replaced to evaluate the adversarial robustness of VLP models on the SNLI-VE dataset using the same experimental operation in image-text retrieval.

Citation

If you find this code to be useful for your research, please consider citing.

@article{wang2023exploring,
  title={Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning},
  author={Wang, Youze and Hu, Wenbo and Dong, Yinpeng and Zhang, Hanwang and Su, Hang and Hong, Richang},
  journal{IEEE Transactions on Multimedia},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages