1University of Southern Queensland 2University of Queensland
3The Hong Kong Polytechnic University
4University of Electronic Science and Technology of China
SVIP is a transformer-based framework designed to enhance visual-semantic alignment for zero-shot learning. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch’s semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves stateof-the-art performance results while providing more interpretable and semantically rich feature representations.
Python 3.12
PyTorch 2.5.1
All experiments are tested with a single NVIDIA RTX 3090 GPU.
- Dataset: please download the dataset, i.e., CUB, AWA2, SUN, and put the datasets in ./data/ folder
- Data split and meta data: please download the info-files folder and place it in ./info-files/.
- attribute w2v: use scripts in ./tools to generate attribute w2v and place in ./attribute/w2v folder.
- Pre-trained models: please download the pre-trained models and place it in ./pretrained_models/.
We provide the trained ZSL model checkpoints for three datasets as follows:
| Dataset | ZSL Accuracy | Download link | GZSL Accuracy | Download link |
|---|---|---|---|---|
| CUB | 79.8 | Download | 75.0 | Download |
| AWA2 | 69.8 | Download | 74.9 | Download |
| SUN | 71.6 | Download | 50.7 | Download |
This work is under the Apache License Version 2.0, while some specific implementations in this codebase might be with other licenses.
Kindly refer to LICENSE.md for a more careful check, if you are using our code for commercial matters.
If you find this work helpful for your research, please kindly consider citing our paper:
@inproceedings{chen2025svip,
title = {SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning},
author = {Chen, Zhi and Zhao, Zecheng and Guo, Jingcai and Li, Jingjing and Huang, Zi},
booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}
