This is the code repository for the paper:
CoV: Chain-of-View Prompting for Spatial Reasoning
Haoyu Zhao*, Akide Liu*, Zeyu Zhang*, Weijie Wang*, Feng Chen, Ruihan Zhu, Gholamreza Haffari and Bohan Zhuang†
*Equal contribution. †Corresponding author.
ACL 2026 Findings
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached.
- 2026-01-09 We release paper on arXiv.
.
├── cov/ # Main package
├── scripts/ # Utility scripts
├── tools/ # Data processing tools
├── main.py # Main entry point
├── pixi.toml # Pixi environment configuration
└── README.md
- Python 3.9+
- CUDA support (recommended for HabTat Sim)
The project uses Pixi for dependency management:
# Install dependencies
pixi install
# Activate the environment
pixi shellCreate a .env file in the root directory with your API credentials:
# OpenAI
OPENAI_API_KEY=[your_key_here]
# OpenRouter
OPENROUTER_API_BASE=https://openrouter.api.com/api/v1
OPENROUTER_API_KEY=[your_key_here]
# Dashscop
DASHSCOPE_API_BASE=https://dashscope.aliyuncs.com/compatible-mode/v1
DASHSCOPE_API_KEY=[your_key_here]- Download the OpenEQA dataset following the original dataset.
- Place question files in
data/directory - Place scene frames in
data/frames/directory
Run the agent on OpenEQA questions:
# Specify models.
python main.py model=qwen
# Specify min_action_step
python main.py model=qwen min_action_step=7You can set your own model backend in cov/config.py.
Results are saved to the configured output directory with:
- JSON files containing answers and metadata
- HTML reports showing navigation history and visualizations
- Screenshots of selected views and bird's eye views
For evaluation, please follow the LLM-Match protocol from OpenEQA.
If you use Chain of View in your research, please cite this work.
@article{zhao2026cov,
title={CoV: Chain-of-View Prompting for Spatial Reasoning},
author={Zhao, Haoyu and Liu, Akide and Zhang, Zeyu and Wang, Weijie and Chen, Feng and Zhu, Ruihan and Haffari, Gholamreza and Zhuang, Bohan},
journal={arXiv preprint arXiv:2601.05172},
year={2026}
}
