Skip to content

AdaptVision/AdaptVision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[CVPR 2026 Highlight] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Release

  • [2026.04.09] AdaptVision has been selected as a CVPR 2026 Highlight paper.
  • [2026.02.21] AdaptVision is accepted by CVPR 2026.
  • [2025.11.18] 🔥 AdaptVision is coming! We release the project page, paper, code and models.

Contents

Demo

The full runnable notebook is available in cookbooks/adaptvision.ipynb.

Question: Is there a stop sign facing us?

AdaptVision requests a local region from the global view. High-resolution local crop requested by AdaptVision.

Global view -> local zoom -> final answer: Yes, there is a stop sign facing us.

# Equivalent to the example in cookbooks/adaptvision.ipynb
bot = AdaptVision(model_path="AdaptVision/AdaptVision-7B")
result = bot.run("assets/test_img2.png", "Is there a stop sign facing us?")
show_result(result)
--- Round 1 ---
<think>...I need to zoom in on that area.</think>
<tool_call>{"name": "request_local_region", "arguments": {"bbox_2d": [418, 189, 440, 214]}}</tool_call>

--- Round 2 ---
<answer>Yes, there is a stop sign facing us.</answer>

AdaptVision first reasons over a downsampled global image, then requests a high-resolution local crop before producing the final answer. This active-vision loop helps preserve efficiency while recovering small but decisive details.

Installation

The environment follows the Verl.

git clone https://github.com/AdaptVision/AdaptVision.git
conda create -n adaptvision python=3.11 -y
conda activate adaptvision
# veRL
pip3 install -e . 
# flash-attn
pip3 install flash-attn==2.7.3 --no-build-isolation

pip install transformers==4.51.0
pip install math_verify
pip install ray[default]
pip install tensordict==0.6.2
pip install qwen_vl_utils

Train

Data Preparation

# train file
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Train --local-dir datasets/VisionThink-Smart-Train

# val file
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Val --local-dir datasets/VisionThink-Smart-Val

Train AdaptVision via Reinforcement Learning

To use GPT as the reward model, first set the following environment variables:

AZURE_API_KEY=
AZURE_ENDPOINT=
AZURE_API_VERSION=

Run AdaptVision Training:

bash scripts/run_adaptvision.sh

Evaluation

We use lmms-eval to evaluate our model. Setup the evaulation environment by following instructions here.

We provide the evaluation code detail in scripts/vllm_adaptvision.py.

Citation

If you find this project useful in your research, please consider citing:

@article{lin2025adaptvision,
  title={AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition},
  author={Lin, Zichuan and Liu, Yicheng and Yang, Yang and Tao, Lvfang and Ye, Deheng},
  journal={arXiv preprint arXiv:2512.03794},
  year={2025}
}

Acknowledgement

We would like to thank the following repos for their great work:

License

  • AdaptVision is licensed under the Apache License 2.0.

Releases

No releases published

Packages

 
 
 

Contributors