Skip to content

longmalongma/MUPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

MUPA is a cooperative multi-path, multi-agentic framework designed for Grounded VideoQA, seamlessly unifying video grounding, question answering, answer reflection and evidence aggregation.
By running three distinct reasoning paths and a dedicated Reflection Agent that verifies and fuses their outputs, MUPA achieves high grounding fidelity without sacrificing answer accuracy.

MUPA Framework Overview

🏆 MUPA on Public Benchmarks

Benchmark Evaluation Results (2B / 7B)
ZS NExT-GQA (test) Acc@GQA: 28.7/30.3 IoP@0.5: 38.7/39.4
FT DeVE-QA (test) Acc@GQA: 43.9/47.4 IoP@0.5: 53.3/55.2
ZS ActivityNet-Captions (test) IoU@0.5: 27.2/31.3 mIoU: 31.4/33.1
ZS ActivityNet-RTL (test) IoU@0.5: 20.4/28.9 mIoU: 23.2/32.0
FT TACoS (test) IoU@0.5: 37.1/40.9 mIoU: 34.2/37.8

*“ZS” means zero-shot, “FT” means fine-tuned.

📦 Datasets

GQA (179K):

Dataset Directory Source Link
QVHighlights-QA qvhighlights_qa https://github.com/jayleicn/moment_detr
TACoS-QA tacos_qa https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus
CosMo-Cap-QA cosmo_cap_qa https://github.com/showlab/cosmo
DeVE-QA deve_qa https://github.com/QHUni/DeVE-QA/tree/main

Grounder (210K):

Dataset Directory Source Link
QVHighlights qvhighlights https://github.com/jayleicn/moment_detr
DiDeMo didemo https://github.com/LisaAnne/LocalizingMoments/
TACoS tacos https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus
QuerYD queryd https://www.robots.ox.ac.uk/~vgg/data/queryd/
HiREST (Grounding) hirest https://github.com/j-min/HiREST
HiREST (Step Captioning) hirest https://github.com/j-min/HiREST
CosMo-Cap cosmo_cap https://github.com/showlab/cosmo
InternVid-VTime internvid_vtime https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid

Verifier (232K):

Dataset Directory Source Link
QVHighlights-Verify verifying, qvhighlights https://github.com/jayleicn/moment_detr
DiDeMo-Verify verifying, didemo https://github.com/LisaAnne/LocalizingMoments/
TACoS-Verify verifying,tacos https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus

Benchmarks

Dataset Type Directory Source Link
NExT-GQA Grounded VideoQA nextgqa https://github.com/doc-doc/NExT-GQA
DeVE-QA Grounded VideoQA deve_qa https://github.com/QHUni/DeVE-QA/tree/main
ActivityNet-Captions Moment Retrieval activitynet_captions, activitynet https://cs.stanford.edu/people/ranjaykrishna/densevid/
TACoS Moment Retrieval tacos https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus
ActivityNet-RTL Moment Retrieval activitynet_rtl, activitynet https://github.com/NVlabs/LITA

Notes:

  1. For some datasets (e.g., DeVe-QA), the annotations and videos are stored in different folders. All the directories in Directory need to be downloaded.
  2. Use the following commands to concatenate and extract video tar splits (e.g., videos.tar.gz.00, videos_3fps_480_noaudio.tar.gz.00).
# videos.tar.gz.00, videos.tar.gz.01
cat videos.tar.gz.* | tar -zxvf -

# videos_3fps_480_noaudio.tar.gz.00, videos_3fps_480_noaudio.tar.gz.01
cat videos_3fps_480_noaudio.tar.gz.* | tar -zxvf -

🚀 Training

Our codebase supports training and evaluating on 15 video datasets and benchmarks with the following features.

  • Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
  • Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
  • Customizing the base LLM and conversation templates
  • Monitoring the training process via Tensorboard / Wandb
  • Group sampling for mixed dataset training
  • Multi-process / multi-device evaluation on public benchmarks

See train.md for a quick start guide.

🔮 Evaluation

See eval.md for details about evaluating MUPA on public benchmarks.

About

The official repository of our paper "MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors