GitHub - longmalongma/MUPA: The official repository of our paper "MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering"

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

MUPA is a cooperative multi-path, multi-agentic framework designed for Grounded VideoQA, seamlessly unifying video grounding, question answering, answer reflection and evidence aggregation.
By running three distinct reasoning paths and a dedicated Reflection Agent that verifies and fuses their outputs, MUPA achieves high grounding fidelity without sacrificing answer accuracy.

🏆 MUPA on Public Benchmarks

Benchmark	Evaluation Results (2B / 7B)
`ZS` NExT-GQA (test)	`Acc@GQA: 28.7/30.3` `IoP@0.5: 38.7/39.4`
`FT` DeVE-QA (test)	`Acc@GQA: 43.9/47.4` `IoP@0.5: 53.3/55.2`
`ZS` ActivityNet-Captions (test)	`IoU@0.5: 27.2/31.3` `mIoU: 31.4/33.1`
`ZS` ActivityNet-RTL (test)	`IoU@0.5: 20.4/28.9` `mIoU: 23.2/32.0`
`FT` TACoS (test)	`IoU@0.5: 37.1/40.9` `mIoU: 34.2/37.8`

*“ZS” means zero-shot, “FT” means fine-tuned.

📦 Datasets

GQA (179K):

Dataset	Directory	Source Link
QVHighlights-QA	`qvhighlights_qa`	https://github.com/jayleicn/moment_detr
TACoS-QA	`tacos_qa`	https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus
CosMo-Cap-QA	`cosmo_cap_qa`	https://github.com/showlab/cosmo
DeVE-QA	`deve_qa`	https://github.com/QHUni/DeVE-QA/tree/main

Grounder (210K):

Dataset	Directory	Source Link
QVHighlights	`qvhighlights`	https://github.com/jayleicn/moment_detr
DiDeMo	`didemo`	https://github.com/LisaAnne/LocalizingMoments/
TACoS	`tacos`	https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus
QuerYD	`queryd`	https://www.robots.ox.ac.uk/~vgg/data/queryd/
HiREST (Grounding)	`hirest`	https://github.com/j-min/HiREST
HiREST (Step Captioning)	`hirest`	https://github.com/j-min/HiREST
CosMo-Cap	`cosmo_cap`	https://github.com/showlab/cosmo
InternVid-VTime	`internvid_vtime`	https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid

Verifier (232K):

Dataset	Directory	Source Link
QVHighlights-Verify	`verifying`, `qvhighlights`	https://github.com/jayleicn/moment_detr
DiDeMo-Verify	`verifying`, `didemo`	https://github.com/LisaAnne/LocalizingMoments/
TACoS-Verify	`verifying`,`tacos`	https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus

Benchmarks

Dataset	Type	Directory	Source Link
NExT-GQA	Grounded VideoQA	`nextgqa`	https://github.com/doc-doc/NExT-GQA
DeVE-QA	Grounded VideoQA	`deve_qa`	https://github.com/QHUni/DeVE-QA/tree/main
ActivityNet-Captions	Moment Retrieval	`activitynet_captions`, `activitynet`	https://cs.stanford.edu/people/ranjaykrishna/densevid/
TACoS	Moment Retrieval	`tacos`	https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/tacos-multi-level-corpus
ActivityNet-RTL	Moment Retrieval	`activitynet_rtl`, `activitynet`	https://github.com/NVlabs/LITA

Notes:

For some datasets (e.g., DeVe-QA), the annotations and videos are stored in different folders. All the directories in Directory need to be downloaded.
Use the following commands to concatenate and extract video tar splits (e.g., videos.tar.gz.00, videos_3fps_480_noaudio.tar.gz.00).

# videos.tar.gz.00, videos.tar.gz.01
cat videos.tar.gz.* | tar -zxvf -

# videos_3fps_480_noaudio.tar.gz.00, videos_3fps_480_noaudio.tar.gz.01
cat videos_3fps_480_noaudio.tar.gz.* | tar -zxvf -

🚀 Training

Our codebase supports training and evaluating on 15 video datasets and benchmarks with the following features.

Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
Customizing the base LLM and conversation templates
Monitoring the training process via Tensorboard / Wandb
Group sampling for mixed dataset training
Multi-process / multi-device evaluation on public benchmarks

See train.md for a quick start guide.

🔮 Evaluation

See eval.md for details about evaluating MUPA on public benchmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MUPA		MUPA
docs		docs
figs		figs
run_scripts		run_scripts
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

🏆 MUPA on Public Benchmarks

📦 Datasets

GQA (179K):

Grounder (210K):

Verifier (232K):

Benchmarks

🚀 Training

🔮 Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering

🏆 MUPA on Public Benchmarks

📦 Datasets

GQA (179K):

Grounder (210K):

Verifier (232K):

Benchmarks

🚀 Training

🔮 Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages