MUPA is a cooperative multi-path, multi-agentic framework designed for Grounded VideoQA, seamlessly unifying
video grounding, question answering, answer reflection and evidence aggregation.
By running three distinct reasoning paths and a dedicated Reflection Agent that verifies and fuses their outputs, MUPA
achieves high grounding fidelity without sacrificing answer accuracy.
| Benchmark | Evaluation Results (2B / 7B) |
|---|---|
ZS NExT-GQA (test) |
Acc@GQA: 28.7/30.3 IoP@0.5: 38.7/39.4 |
FT DeVE-QA (test) |
Acc@GQA: 43.9/47.4 IoP@0.5: 53.3/55.2 |
ZS ActivityNet-Captions (test) |
IoU@0.5: 27.2/31.3 mIoU: 31.4/33.1 |
ZS ActivityNet-RTL (test) |
IoU@0.5: 20.4/28.9 mIoU: 23.2/32.0 |
FT TACoS (test) |
IoU@0.5: 37.1/40.9 mIoU: 34.2/37.8 |
*“ZS” means zero-shot, “FT” means fine-tuned.
| Dataset | Directory | Source Link |
|---|---|---|
| QVHighlights-QA | qvhighlights_qa |
https://github.com/jayleicn/moment_detr |
| TACoS-QA | tacos_qa |
https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus |
| CosMo-Cap-QA | cosmo_cap_qa |
https://github.com/showlab/cosmo |
| DeVE-QA | deve_qa |
https://github.com/QHUni/DeVE-QA/tree/main |
| Dataset | Directory | Source Link |
|---|---|---|
| QVHighlights | qvhighlights |
https://github.com/jayleicn/moment_detr |
| DiDeMo | didemo |
https://github.com/LisaAnne/LocalizingMoments/ |
| TACoS | tacos |
https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus |
| QuerYD | queryd |
https://www.robots.ox.ac.uk/~vgg/data/queryd/ |
| HiREST (Grounding) | hirest |
https://github.com/j-min/HiREST |
| HiREST (Step Captioning) | hirest |
https://github.com/j-min/HiREST |
| CosMo-Cap | cosmo_cap |
https://github.com/showlab/cosmo |
| InternVid-VTime | internvid_vtime |
https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid |
| Dataset | Directory | Source Link |
|---|---|---|
| QVHighlights-Verify | verifying, qvhighlights |
https://github.com/jayleicn/moment_detr |
| DiDeMo-Verify | verifying, didemo |
https://github.com/LisaAnne/LocalizingMoments/ |
| TACoS-Verify | verifying,tacos |
https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus |
| Dataset | Type | Directory | Source Link |
|---|---|---|---|
| NExT-GQA | Grounded VideoQA | nextgqa |
https://github.com/doc-doc/NExT-GQA |
| DeVE-QA | Grounded VideoQA | deve_qa |
https://github.com/QHUni/DeVE-QA/tree/main |
| ActivityNet-Captions | Moment Retrieval | activitynet_captions, activitynet |
https://cs.stanford.edu/people/ranjaykrishna/densevid/ |
| TACoS | Moment Retrieval | tacos |
https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus |
| ActivityNet-RTL | Moment Retrieval | activitynet_rtl, activitynet |
https://github.com/NVlabs/LITA |
Notes:
- For some datasets (e.g., DeVe-QA), the annotations and videos are stored in different folders. All the directories
in
Directoryneed to be downloaded. - Use the following commands to concatenate and extract video tar splits (e.g., videos.tar.gz.00, videos_3fps_480_noaudio.tar.gz.00).
# videos.tar.gz.00, videos.tar.gz.01
cat videos.tar.gz.* | tar -zxvf -
# videos_3fps_480_noaudio.tar.gz.00, videos_3fps_480_noaudio.tar.gz.01
cat videos_3fps_480_noaudio.tar.gz.* | tar -zxvf -
Our codebase supports training and evaluating on 15 video datasets and benchmarks with the following features.
- Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
- Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
- Customizing the base LLM and conversation templates
- Monitoring the training process via Tensorboard / Wandb
- Group sampling for mixed dataset training
- Multi-process / multi-device evaluation on public benchmarks
See train.md for a quick start guide.
See eval.md for details about evaluating MUPA on public benchmarks.
