Skip to content

JulietChoo/VisionSelector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VisionSelector-Logo VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

πŸ“„ Paper | πŸ€— VisionSelector-Qwen2.5-VL-3B πŸ€— VisionSelector-Qwen2.5-VL-7B πŸ€— VisionSelector-LLaVA-OV-1.5-8B

Jiaying Zhu1, Yurui Zhu*2, Xin Lu1, Wenrui Yan2, Dong Li1, Kunlin Liu2, Xueyang Fu*1, Zheng-Jun Zha1

1University of Science and Technology of China, 2ZTE Corporation

*Equal Advising

πŸ“ To do list

  • Release training code
  • Release evaluation code
  • Release comparison method code
  • Release model weights
  • Release inference code

πŸ‘€ Overview

We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.

Our key technical innovations include:

  • A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
  • A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
  • A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.

VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73Γ— prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.

Our approach

πŸ’Ύ Dataset Preparation and Configuration

To reproduce our results and train the VisionSelector module, you need to download and configure the following datasets from the nyu-visionx/Cambrian-10M repository.

1. Dataset Downloading

Please download the following datasets and annotations and place them under the datasets/ folder.

Dataset Size Download Link (Hugging Face)
OCR_VQA $\sim 80\text{K}$ ocr_vqa.tar.gz
ChartQA $\sim 28\text{K}$ chartqa.tar.gz
TextQA $\sim 21\text{K}$ textvqa.tar.gz
COCO $\sim 364\text{K}$ coco.tar.gz
Annotation Download Link (Hugging Face)
Cambrian737K Cambrian737k.jsonl

Generating Dataset Annotations:

The large annotation file (Cambrian737k.jsonl) needs to be split into individual JSONL files for each target dataset (OCR_VQA, ChartQA, COCO) to match the required directory structure.

Execute the following commands from the project root to perform this filtering and splitting using the filter_json.py script:

cd VisionSelector
python datasets/filter_json.py
python datasets/sample_merge_json_llavaov.py # for llava-ov-1.5

Required Directory Structure:

After downloading and extracting, your project directory should contain the following structure:

VisionSelector/
└── datasets/
    β”œβ”€β”€ ocr_vqa/
    β”œβ”€β”€ ocr_vqa_cambrian.jsonl
    β”œβ”€β”€ chartqa/
    β”œβ”€β”€ chartqa_cambrian.jsonl
    β”œβ”€β”€ textvqa/
    β”œβ”€β”€ textvqa_cambrian.jsonl
    β”œβ”€β”€ coco/
    β”œβ”€β”€ coco_cambrian.jsonl
    └── textvqa_ocrvqa_cambrian.jsonl

2. Dataset config for training

The data paths and annotation paths for training are defined in VisionSelector/qwen-vl-finetune/qwenvl/data/__init__.py.

DATASET_NAME = {
    "annotation_path": "/path/to/annotations.json",
    "data_path": "/path/to/image/data",
}

πŸ”§ Installation - Qwen2.5VL

1. Environment and Basic Dependencies

We recommend setting up a dedicated Conda environment.

conda create -n visionselector python=3.10
conda activate visionselector

2. Package Installation

Navigate to your project root directory (VisionSelector) and install the required packages.

cd VisionSelector
pip install qwen-vl-utils[decord]
pip install -r requirements.txt
pip install transformers==4.50.0

Recommended Package Versions:

For optimal compatibility, we recommend the following package versions:

Package Recommended Version
torch 2.6.0
torchvision 0.21.0
transformers 4.50.0
deepspeed 0.16.4
flash_attn 2.7.4.post1
triton 3.2.0
accelerate 1.4.0
torchcodec 0.2

πŸš€ Quick Start - Qwen2.5VL

1. Training

To train the VisionSelector (e.g., integrated with Qwen2.5-VL-7B) to learn crucial token selection, execute the script in the qwen-vl-finetune directory.

cd VisionSelector/qwen-vl-finetune
bash scripts/sft_7b.sh # for VisionSelector-Qwen2.5-VL-7B
bash scripts/sft_3b.sh # for VisionSelector-Qwen2.5-VL-3B
bash scripts/sft_dynamic.sh # Reimplementation of Dynamic-LLaVA's image predictor on Qwen2.5VL(Dynamic-Qwen)

2. Evaluation, Inference and Visualizations

We utilize lmms-eval for comprehensive benchmarking.

Setup lmms-eval

First, set up the evaluation environment:

cd VisionSelector/lmms-eval
pip install -e .
cd ../qwen-evaluation

Run Evaluations

Use the provided scripts to evaluate VisionSelector against comparison methods and generate visualizations.

Command Purpose
bash run_token_compression.sh Evaluation for Original Model and Comparison Token Compression Methods
bash run_selector.sh Evaluation for the VisionSelector Method
bash run_dynamic_qwen.sh Evaluation for the Dynamic-Qwen Method

To capture Max GPU Memory, Prefill Time, Latency Time and Number of visual tokens, set the following environment variable before running the evaluation script:

EVAL_TIME=True

Run Inference

You can test different token pruning methods (VisionZip, DivPrune) and VisionSelector inference with this script:

bash run_inference.sh

Visualizations

To generate activation heatmaps and token pruning visualizations:

bash run_visual.sh # for VisionZip, DivPrune and VisionSelector

πŸ”§ Installation - LLaVA-OV-1.5

1. Environment, Basic Dependencies and Package Installation

To ensure compatibility with the LLaVA-OneVision-1.5 framework, activate the pre-created visionselector environment first and then adjust the transformers package version as follows:

conda activate visionselector
pip uninstall transformers
pip install transformers==4.53.1

πŸš€ Quick Start - LLaVA-OV-1.5

1. Training

To train the VisionSelector (e.g., integrated with LLaVA-OneVision-1.5) to learn crucial token selection, execute the script in the llava-ov-15 directory.

cd VisionSelector/llava-ov-15
bash scripts/finetune_selector_8b.sh # for LLaVA-OneVision-1.5

2. Evaluation

cd llava-ov-15
bash run_ov_token_compression.sh # for orignal model and comparison method
bash run_ov_selector.sh # for VisionSelector

3. Inference

bash run_ov_inference.sh

Cititation

If this work contributes to your research, please cite:

@misc{zhu2025visionselectorendtoendlearnablevisual,
      title={VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs}, 
      author={Jiaying Zhu and Yurui Zhu and Xin Lu and Wenrui Yan and Dong Li and Kunlin Liu and Xueyang Fu and Zheng-Jun Zha},
      year={2025},
      eprint={2510.16598},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.16598}, 
}

Acknowledgement

This work is built upon the foundational contributions of several excellent open-source projects. We express our sincere gratitude to the developers of the following resources, which were instrumental in the development and evaluation of VisionSelector:

About

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published