π DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Official implementation of the CVPR 2026 paper
TL;DR. DeepScan is a training-free framework for visually grounded reasoning in LVLMs. It follows a bottom-up pipeline with Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning, enabling stronger grounded reasoning without any additional training.
- TODO. Batched Inference.
- 2026-04. Evaluation scripts are released.
- 2026-03. The core codebase is open-sourced.
- 2026-02. DeepScan was accepted to CVPR 2026 main track.
DeepScan improves visually grounded reasoning by decomposing inference into three stages:
-
Hierarchical Scanning
Discover local visual cues and recover candidate evidence regions. -
Refocusing
Search for the smallest complete view that preserves the required context. -
Evidence-Enhanced Reasoning
Answer the question using an ordered multi-image evidence memory.
Unlike RL-based grounded reasoning approaches, DeepScan is plug-and-play and training-free, and can be integrated with different LVLM backbones directly at test time.
DeepScan requires three conda environments:
lavisfor the search expertlangsamfor the visual expert and SAM2deepscanfor the main LVLM runtime
git clone https://github.com/YChenL/DeepScan
cd DeepScanconda env create -f lavis.yml
conda activate lavisconda env create -f langsam.yml
conda activate langsamconda env create -f dyfo.yml
conda activate deepscanAfter creating the lavis environment, modify:
your_env_path/lavis/models/blip_models/blip_image_text_matching.py
Replace the following lines:
# encoder_input_ids[:, 0] = self.tokenizer.enc_token_id
encoder_input_ids[:, 0] = self.tokenizer.convert_tokens_to_ids("[ENC]")This patch is required by the BLIP-based search expert used in DeepScan.
Please download the following checkpoints from Hugging Face:
google-bert/bert-base-uncasedIDEA-Research/grounding-dino-basefacebook/sam2.1-hiera-smallfacebook/sam2.1-hiera-base-plusQwen/Qwen2.5-VL-7B-InstructQwen/Qwen3-VL-8B-Instruct
Example:
huggingface-cli download google-bert/bert-base-uncased \
--local-dir bert-base-uncased \
--local-dir-use-symlinks False \
--resume-downloadDownload the other checkpoints in the same way.
Please place the checkpoint from:
facebook/sam2.1-hiera-base-plus
into the checkpoints directory of the installed sam2 package inside the langsam environment.
Before launching DeepScan, replace the local paths in the following files.
File:
code/scripts/blip_server/blip_service.py
Set:
LOCAL_TOKENIZER_PATH = "your/local/google-bert/bert-base-uncased/path"File:
code/scripts/expert_server/model_service.py
Example:
self.model = LangSAM(
sam_type="sam2.1_hiera_small",
ckpt_path_sam="/your/local/facebook/sam2.1-hiera-small/sam2.1_hiera_small.pt",
ckpt_path_gdino="/your/local/IDEA-Research/grounding-dino-base"
)File:
code/scripts/sam2_server/sam2_service.py
Set:
SAM2_REPO_ROOT = Path("/your/envs/langsam/lib/python3.11/site-packages/sam2")This should be the absolute path to the installed sam2 package in your langsam environment.
Please download the evaluation datasets from oking0197/Dyfo, then organize them as follows:
DeepScan/
βββ code/
β βββ scripts/
β βββ src/
βββ playground/
βββ data/
βββ eval/
βββ vstar/
βββ ...
DeepScan uses three components:
We use BLIP-ITM to produce patch-wise Grad-CAM attention maps for cue discovery.
The visual expert provides:
- point-prompt segmentation
- text-conditioned detection
In our implementation, this is realized by:
- a LangSAM-based detection service
- a SAM2 point-prompt segmentation service
The pipeline supports compatible LVLM backbones such as:
- Qwen2.5-VL-7B-Instruct
- Qwen3-VL-8B-Instruct
DeepScan is a multi-service pipeline. You need to start:
- the visual expert server
- the search expert server
- the SAM2 segmentation server
Below is one example setup using two RTX 4090 GPUs:
cuda:0for expert servicescuda:1for the main LVLM runtime
conda activate langsam
bash code/scripts/expert_server/start_server.shExpected log:
Starting server on port 8000
INFO: Started server process [xxxxx]
INFO: Waiting for application startup.
INFO: Port: 8000, Uptime: 0.00s, Current queue size: 0
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
conda activate lavis
bash code/scripts/blip_server/start_server.shExpected log:
Starting server on port 8100
--- loading local tokenizer from: /your/local/bert-base-uncased/ ---
INFO: Started server process [xxxxx]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit)
conda activate langsam
bash code/scripts/sam2_server/start_server.shExpected log:
Starting server on port 8200
Working directory changed to: /your/envs/langsam/lib/python3.11/site-packages/sam2
INFO: Started server process [xxxxx]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8200 (Press CTRL+C to quit)
Note: Please adjust the ports, CUDA devices, and checkpoint paths in the scripts before launching.
After the expert services are ready, switch to the main runtime environment and run:
conda activate deepscan
bash code/scripts/vstar/stream_vstar_qwen.sh oursmcts FalseBefore running, please update the checkpoint path inside the script, e.g.
CKPT="/your/local/path/for/Qwen-VL"Expected log:
Creating samples: 100%|ββββββββββ| 191/191 [00:04<00:00, ...it/s]
Processing samples: 0%| | 0/191 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:03<00:00, ...it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k'].
...
You may also invoke the main entry point directly through code/src/run.py, depending on your local setup.
A typical setup can run on 2 Γ RTX 4090.
Example nvidia-smi snapshot:
+-----------------------------------------------------------------------------------------+
| GPU Name Memory-Usage |
| 0 NVIDIA GeForce RTX 4090 11199MiB / 24576MiB |
| 1 NVIDIA GeForce RTX 4090 17307MiB / 24576MiB |
+-----------------------------------------------------------------------------------------+
A practical allocation is:
- GPU 0: visual expert + search expert + SAM2 server
- GPU 1: Qwen-VL main runtime
- On 2 Γ RTX 4090, evaluating V* takes about 3 hours in total, which is roughly 1 minute per sample.
- On 4 Γ RTX 4090, one GPU can be used to host the expert servers, while the remaining three GPUs run the main evaluation with DDP-based data splitting.
- Under this 4-GPU setup, the runtime is reduced to about 20 seconds per sample, which is close to the efficiency reported in the paper.
DeepScan is a test-time scaling framework. Its inference cost is higher than one-shot inference, but it provides a clear performance-efficiency trade-off through:
- patch size
- retained evidence count
k - batched engineering optimizations
The optimized implementation benefits from:
- batched attention-map computation
- batched top-k evidence judgment
- batched view justification
- vLLM-based serving
DeepScan builds on several excellent open-source projects and model ecosystems. We especially thank DyFo for its inspiring open-source release.
We also acknowledge:
- Qwen2-VL / Qwen2.5-VL / Qwen3-VL
- LAVIS
- LangSAM
- SAM2
- vLLM
If you find DeepScan useful, please cite:
@article{li2026deepscan,
title={DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models},
author={Li, Yangfu and Zhan, Hongjian and Chen, Jiawei and Gong, Yuning and Liu, Qi and Lu, Yue},
journal={arXiv preprint arXiv:2603.03857},
year={2026}
}