Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisionCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisionCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box IoU. Extensive experiments demonstrate that VisionCoach achieves state-of-the-art performance across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools.
Run the following script to set up the environment:
bash ./setup.shWe use the same dataset as Open-o3-Video. The data structure is organized as follows:
data/videos/
├── stgr/ # STGR (Spatio-Temporal Grounded Reasoning)
│ ├── plm/ # PLM subset
│ │ ├── kfs/ # Keyframes
│ │ └── videos/ # Video files
│ └── temporal_grounding/ # Temporal grounding subset
│ ├── kfs/
│ └── videos/
├── tvg_r1/ # TVG-R1
│ └── videomind_data/
├── videor1/ # Video-R1
├── videoespresso/ # VideoEspresso
├── treevgr/ # TreeVGR
├── gqa/ # GQA
└── timerft/ # TimeRFT
Cold start initialization:
bash ./src/scripts/run_sft_video.shReinforcement learning with GRPO:
bash ./src/scripts/run_grpo_video.shWe provide benchmark evaluation including V-STAR, VideoMME, VideoMMMU, World-Sense, and PerceptionTest.
cd eval
bash ./scripts/eval_all.sh- VP-Selector training code
- Visual prompting generation code
If you use our work or our implementation in this repo, or find them helpful, please consider citing:
@misc{lee2026visioncoachreinforcinggroundedvideo,
title={VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting},
author={Daeun Lee and Shoubin Yu and Yue Zhang and Mohit Bansal},
year={2026},
eprint={2603.14659},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.14659},
}We sincerely thank the following projects for their contributions to this work:

