Skip to content
View AdaptVision's full-sized avatar

Block or report AdaptVision

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
AdaptVision/README.md

[CVPR 2026 Highlight] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Release

  • [2026.04.09] AdaptVision has been selected as a CVPR 2026 Highlight paper.
  • [2026.02.21] AdaptVision is accepted by CVPR 2026.
  • [2025.11.18] 🔥 AdaptVision is coming! We release the project page, paper, code and models.

Contents

Demo

The full runnable notebook is available in cookbooks/adaptvision.ipynb.

Question: Is there a stop sign facing us?

AdaptVision requests a local region from the global view. High-resolution local crop requested by AdaptVision.

Global view -> local zoom -> final answer: Yes, there is a stop sign facing us.

# Equivalent to the example in cookbooks/adaptvision.ipynb
bot = AdaptVision(model_path="AdaptVision/AdaptVision-7B")
result = bot.run("assets/test_img2.png", "Is there a stop sign facing us?")
show_result(result)
--- Round 1 ---
<think>...I need to zoom in on that area.</think>
<tool_call>{"name": "request_local_region", "arguments": {"bbox_2d": [418, 189, 440, 214]}}</tool_call>

--- Round 2 ---
<answer>Yes, there is a stop sign facing us.</answer>

AdaptVision first reasons over a downsampled global image, then requests a high-resolution local crop before producing the final answer. This active-vision loop helps preserve efficiency while recovering small but decisive details.

Installation

The environment follows the Verl.

git clone https://github.com/AdaptVision/AdaptVision.git
conda create -n adaptvision python=3.11 -y
conda activate adaptvision
# veRL
pip3 install -e . 
# flash-attn
pip3 install flash-attn==2.7.3 --no-build-isolation

pip install transformers==4.51.0
pip install math_verify
pip install ray[default]
pip install tensordict==0.6.2
pip install qwen_vl_utils

Train

Data Preparation

# train file
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Train --local-dir datasets/VisionThink-Smart-Train

# val file
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Val --local-dir datasets/VisionThink-Smart-Val

Train AdaptVision via Reinforcement Learning

To use GPT as the reward model, first set the following environment variables:

AZURE_API_KEY=
AZURE_ENDPOINT=
AZURE_API_VERSION=

Run AdaptVision Training:

bash scripts/run_adaptvision.sh

Evaluation

We use lmms-eval to evaluate our model. Setup the evaulation environment by following instructions here.

We provide the evaluation code detail in scripts/vllm_adaptvision.py.

Citation

If you find this project useful in your research, please consider citing:

@article{lin2025adaptvision,
  title={AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition},
  author={Lin, Zichuan and Liu, Yicheng and Yang, Yang and Tao, Lvfang and Ye, Deheng},
  journal={arXiv preprint arXiv:2512.03794},
  year={2025}
}

Acknowledgement

We would like to thank the following repos for their great work:

License

  • AdaptVision is licensed under the Apache License 2.0.

Popular repositories Loading

  1. AdaptVision AdaptVision Public

    [CVPR 2026] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

    Python 36 1

  2. AdaptVision.github.io AdaptVision.github.io Public

    HTML