Skip to content

[EMNLP 2025 Main] Official implementation for "Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering".

License

Notifications You must be signed in to change notification settings

ANDgate99/Explore-Then-Select

Repository files navigation

Explore-Then-Select

Official implementation for Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering (EMNLP 2025 Main). [Paper]

Introduction

alt text Video question answering benefits from the rich information available in videos, enabling a wide range of applications. However, the large volume of tokens generated from longer videos presents significant challenges to memory efficiency and model performance. To alleviate this issue, existing works propose to compress video inputs, but usually overlooking the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. To tackle this, we propose a novel token selection strategy, explore-then-select, that adaptively adjust static and dynamic information needed based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Next, it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our proposed framework is plug-and-play that can be seamlessly integrated within diverse video-language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) among various video question answering benchmarks.

Repo Structure

.
├── model # modified model files
├── processor # modified processor files
├── scripts # running shell scripts
├── test # test python entries
└── utils # some datasets

Environment Setup

You should first install LLaVA following the instruction in the repository (https://github.com/LLaVA-VL/LLaVA-NeXT.git). Then pip install -r requirements.txt. If you want to run Qwen2.5-VL, please refer to requirements-qwen2.5-vl.txt.

Dataset Preparation

VideoMME & EgoSchema & ActivityNet-QA & MLVU-Test

MSVD-QA & MSRVTT-QA

Path Configuration

Before running the code, you need to update the paths in several files to match your local environment:

  1. All test scripts in the test/ directory contain default path arguments that need to be modified:

    parser.add_argument("--model", type=str, default="/path/to/dataset/Qwen2-VL-7B-Instruct")
    parser.add_argument("--dataset", type=str, default="/path/to/dataset/VideoMME")
    parser.add_argument("--video_path", type=str, default="/path/to/dataset/VideoMME/video")
  2. Replace these paths with your actual paths where you've downloaded and stored:

    • The model files (e.g., Qwen2-VL-7B-Instruct, llava-onevision-qwen2-7b-ov)
    • The dataset folders (VideoMME, EgoSchema, MLVU, MSVD-QA, MSRVTT-QA, ActivityNet-QA)
    • The video folders within each dataset

Code Running

Test

After configuring all paths, you can run the test using:

python test/test_mme_qwen.py \
    --model /your/actual/path/to/Qwen2-VL-7B-Instruct \
    --dataset /your/actual/path/to/VideoMME \
    --video_path /your/actual/path/to/VideoMME/video

You can also modify and run the corresponding script in the scripts folder.

Evaluate

Please check:

  • test/evaluate_ego.py
  • test/evaluate_mme.py
  • test/evaluate_msvd.py
  • test/evaluate_activitynet.py

Acknowledgments

We would like to express our gratitude to the following excellent projects:

  • Qwen2.5-VL: We added our plugin on the Qwen2-VL and Qwen2.5-VL models.
  • LLaVA-NeXT: We added our plugin on the LLaVA-OneVision model.
  • MovieChat: We referenced its evaluation of open-ended questions.
  • ALPRO: We referenced its preparation of MSVD-QA and MSRVTT-QA datasets.

We also sincerely thank the providers and curators of the datasets utilized in our project.

Citation

@article{shi2025static,
  title={Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering},
  author={Shi, Yumeng and Long, Quanyu and Wang, Wenya},
  journal={arXiv preprint arXiv:2504.21403},
  year={2025}
}

About

[EMNLP 2025 Main] Official implementation for "Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published