Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning

This is the official PyTorch implement of the paper "Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning" at TMM 2025.

Dependencies

pip install -r requirment.txt

Usage

Dataset
- Dataset json files for downstream tasks: ALBEF github
Victim Models
- ALBEF
  - fine-tuned checkpoint
- TCL
  - fine-tuned checkpoint
- BLIP for retrieval
- BLIP2
  - Retrieval
  - Visual Entailment
- Open CLIP
- EVA-02-CLIP
- CLIP

Evaluation

Image-Text Retrieval Task

Download MSCOCO or Flickr30k datasets from origin website.
For text augmentation, round-trip translation is employed to generate diverse textual variations. For image transformation, a combination of techniques—including rotation, polarization, translation, shear, color jittering, and cropping—is applied to enhance data diversity.
Run

# For instance, we utilize multimodal adversarial examples generated by CLIP to evaluate the target models, ALBEF and TCL, in a transfer-based setting on the FLICKR30K dataset.

python ./Retrieval/CLIP/eval_clip2albef_flickr.py

Visual Entailment Task

For the visual entailment task, the victim models and dataset can be replaced to evaluate the adversarial robustness of VLP models on the SNLI-VE dataset using the same experimental operation in image-text retrieval.

Citation

If you find this code to be useful for your research, please consider citing.

@article{wang2023exploring,
  title={Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning},
  author={Wang, Youze and Hu, Wenbo and Dong, Yinpeng and Zhang, Hanwang and Su, Hang and Hong, Richang},
  journal{IEEE Transactions on Multimedia},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Retrieval		Retrieval
configs		configs
dataset		dataset
models		models
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning

Dependencies

Usage

Evaluation

Image-Text Retrieval Task

Visual Entailment Task

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning

Dependencies

Usage

Evaluation

Image-Text Retrieval Task

Visual Entailment Task

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages