About

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

Zhi Chen^1*    Zecheng Zhao^2*    Jingcai Guo³    Jingjing Li⁴    Zi Huang²

¹University of Southern Queensland     ²University of Queensland

³The Hong Kong Polytechnic University

⁴University of Electronic Science and Technology of China

About

SVIP is a transformer-based framework designed to enhance visual-semantic alignment for zero-shot learning. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch’s semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves stateof-the-art performance results while providing more interpretable and semantically rich feature representations.

⚙️ Installation

Python 3.12

PyTorch 2.5.1

All experiments are tested with a single NVIDIA RTX 3090 GPU.

♨️ Data Preparation

Dataset: please download the dataset, i.e., CUB, AWA2, SUN, and put the datasets in ./data/ folder
Data split and meta data: please download the info-files folder and place it in ./info-files/.
attribute w2v: use scripts in ./tools to generate attribute w2v and place in ./attribute/w2v folder.
Pre-trained models: please download the pre-trained models and place it in ./pretrained_models/.

📊 Main Results

We provide the trained ZSL model checkpoints for three datasets as follows:

Dataset	ZSL Accuracy	Download link	GZSL Accuracy	Download link
CUB	79.8	Download	75.0	Download
AWA2	69.8	Download	74.9	Download
SUN	71.6	Download	50.7	Download

License

This work is under the Apache License Version 2.0, while some specific implementations in this codebase might be with other licenses.

Kindly refer to LICENSE.md for a more careful check, if you are using our code for commercial matters.

Citation

If you find this work helpful for your research, please kindly consider citing our paper:

@inproceedings{chen2025svip,
    title = {SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning},
    author = {Chen, Zhi and Zhao, Zecheng and Guo, Jingcai and Li, Jingjing and Huang, Zi},
    booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
    year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs/figs		docs/figs
models		models
scripts		scripts
tools		tools
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
main.py		main.py
vit_utils.py		vit_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

About

⚙️ Installation

♨️ Data Preparation

📊 Main Results

License

Citation

About

Uh oh!

Releases

Packages

Languages

uqzhichen/SVIP

Folders and files

Latest commit

History

Repository files navigation

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

About

⚙️ Installation

♨️ Data Preparation

📊 Main Results

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages