π Paper | π€ VisionSelector-Qwen2.5-VL-3B π€ VisionSelector-Qwen2.5-VL-7B π€ VisionSelector-LLaVA-OV-1.5-8B
Jiaying Zhu1, Yurui Zhu*2, Xin Lu1, Wenrui Yan2, Dong Li1, Kunlin Liu2, Xueyang Fu*1, Zheng-Jun Zha1
1University of Science and Technology of China, 2ZTE Corporation
*Equal Advising
- Release training code
- Release evaluation code
- Release comparison method code
- Release model weights
- Release inference code
We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.
Our key technical innovations include:
- A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
- A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
- A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.
VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73Γ prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.
To reproduce our results and train the VisionSelector module, you need to download and configure the following datasets from the nyu-visionx/Cambrian-10M repository.
Please download the following datasets and annotations and place them under the datasets/ folder.
| Dataset | Size | Download Link (Hugging Face) |
|---|---|---|
| OCR_VQA | ocr_vqa.tar.gz | |
| ChartQA | chartqa.tar.gz | |
| TextQA | textvqa.tar.gz | |
| COCO | coco.tar.gz |
| Annotation | Download Link (Hugging Face) |
|---|---|
| Cambrian737K | Cambrian737k.jsonl |
The large annotation file (Cambrian737k.jsonl) needs to be split into individual JSONL files for each target dataset (OCR_VQA, ChartQA, COCO) to match the required directory structure.
Execute the following commands from the project root to perform this filtering and splitting using the filter_json.py script:
cd VisionSelector
python datasets/filter_json.py
python datasets/sample_merge_json_llavaov.py # for llava-ov-1.5After downloading and extracting, your project directory should contain the following structure:
VisionSelector/
βββ datasets/
βββ ocr_vqa/
βββ ocr_vqa_cambrian.jsonl
βββ chartqa/
βββ chartqa_cambrian.jsonl
βββ textvqa/
βββ textvqa_cambrian.jsonl
βββ coco/
βββ coco_cambrian.jsonl
βββ textvqa_ocrvqa_cambrian.jsonlThe data paths and annotation paths for training are defined in VisionSelector/qwen-vl-finetune/qwenvl/data/__init__.py.
DATASET_NAME = {
"annotation_path": "/path/to/annotations.json",
"data_path": "/path/to/image/data",
}We recommend setting up a dedicated Conda environment.
conda create -n visionselector python=3.10
conda activate visionselectorNavigate to your project root directory (VisionSelector) and install the required packages.
cd VisionSelector
pip install qwen-vl-utils[decord]
pip install -r requirements.txt
pip install transformers==4.50.0For optimal compatibility, we recommend the following package versions:
| Package | Recommended Version |
|---|---|
torch |
2.6.0 |
torchvision |
0.21.0 |
transformers |
4.50.0 |
deepspeed |
0.16.4 |
flash_attn |
2.7.4.post1 |
triton |
3.2.0 |
accelerate |
1.4.0 |
torchcodec |
0.2 |
To train the VisionSelector (e.g., integrated with Qwen2.5-VL-7B) to learn crucial token selection, execute the script in the qwen-vl-finetune directory.
cd VisionSelector/qwen-vl-finetune
bash scripts/sft_7b.sh # for VisionSelector-Qwen2.5-VL-7B
bash scripts/sft_3b.sh # for VisionSelector-Qwen2.5-VL-3B
bash scripts/sft_dynamic.sh # Reimplementation of Dynamic-LLaVA's image predictor on Qwen2.5VL(Dynamic-Qwen)We utilize lmms-eval for comprehensive benchmarking.
First, set up the evaluation environment:
cd VisionSelector/lmms-eval
pip install -e .
cd ../qwen-evaluationUse the provided scripts to evaluate VisionSelector against comparison methods and generate visualizations.
| Command | Purpose |
|---|---|
bash run_token_compression.sh |
Evaluation for Original Model and Comparison Token Compression Methods |
bash run_selector.sh |
Evaluation for the VisionSelector Method |
bash run_dynamic_qwen.sh |
Evaluation for the Dynamic-Qwen Method |
To capture Max GPU Memory, Prefill Time, Latency Time and Number of visual tokens, set the following environment variable before running the evaluation script:
EVAL_TIME=TrueYou can test different token pruning methods (VisionZip, DivPrune) and VisionSelector inference with this script:
bash run_inference.shTo generate activation heatmaps and token pruning visualizations:
bash run_visual.sh # for VisionZip, DivPrune and VisionSelectorTo ensure compatibility with the LLaVA-OneVision-1.5 framework, activate the pre-created visionselector environment first and then adjust the transformers package version as follows:
conda activate visionselector
pip uninstall transformers
pip install transformers==4.53.1To train the VisionSelector (e.g., integrated with LLaVA-OneVision-1.5) to learn crucial token selection, execute the script in the llava-ov-15 directory.
cd VisionSelector/llava-ov-15
bash scripts/finetune_selector_8b.sh # for LLaVA-OneVision-1.5cd llava-ov-15
bash run_ov_token_compression.sh # for orignal model and comparison method
bash run_ov_selector.sh # for VisionSelectorbash run_ov_inference.shIf this work contributes to your research, please cite:
@misc{zhu2025visionselectorendtoendlearnablevisual,
title={VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs},
author={Jiaying Zhu and Yurui Zhu and Xin Lu and Wenrui Yan and Dong Li and Kunlin Liu and Xueyang Fu and Zheng-Jun Zha},
year={2025},
eprint={2510.16598},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.16598},
}
This work is built upon the foundational contributions of several excellent open-source projects. We express our sincere gratitude to the developers of the following resources, which were instrumental in the development and evaluation of VisionSelector:
- Foundational Platforms: Qwen2.5-VL, LLaVA-OneVision-1.5, EffiVLM-Bench, and Lmms-Eval.
- Inspirational Methods: We also gratefully acknowledge the valuable insights and prior work provided by FastV, PruMerge+, VisionZip, DART, DivPrune, Dynamic-LLaVA and Differentiable Top-K.
