🤗 Hugging Face | 📑 Paper | 🌐 Website
- [7/2025] Paper accepted to ICCV 2025!
- [3/2025] VMBench evaluation code & prompt set released!
Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench---a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment.
Prompt: A tourist joyfully splashes water in an outdoor swimming pool, their arms and legs moving energetically as they playfully splash around.
cogvideo-1.mp4 |
hunyuan-1.mp4 |
mochi-1.mp4 |
opensora-1.mp4 |
opensoraplan-1.mp4 |
wan-1.mp4 |
Prompt: Three books are thrown into the air, their pages fluttering as they soar over the soccer field, landing in a scattered pattern.
cogvideo-2.mp4 |
hunyuan-2.mp4 |
mochi-2.mp4 |
opensora-2.mp4 |
opensora-plan-2.mp4 |
wan-2.mp4 |
Prompt: Four flickering candles cast shadows as they burn steadily on the balcony, their flames dancing with the gentle breeze.
cogvideo-3.mp4 |
hunyuan-3.mp4 |
mochi-3.mp4 |
opensora-3.mp4 |
opensora-plan-3.mp4 |
wan-3.mp4 |
Prompt: Two penguins waddle along the beach, occasionally stopping to preen their feathers before continuing their journey across the ocean shore.
cogvideo-4.mp4 |
hunyuan-4.mp4 |
mochi-4.mp4 |
opensora-4.mp4 |
opensora-plan-4.mp4 |
wan-4.mp4 |
Prompt: In the bustling street, two kids run towards a small dog, bending down to carefully comb its fur, their hands moving swiftly.
cogvideo-5.mp4 |
hunyuan-5.mp4 |
mochi-5.mp4 |
opensora-5.mp4 |
opensora-plan-5.mp4 |
wan-5.mp4 |
Prompt: In the garage, a young girl twirls gracefully, her arms outstretched, perfectly matching the lively country line dance beat.
cogvideo-6.mp4 |
hunyuan-6.mp4 |
mochi-6.mp4 |
opensora-6.mp4 |
opensora-plan-6.mp4 |
wan-6.mp4 |
| Models | Avg | CAS | MSS | OIS | PAS | TCS |
|---|---|---|---|---|---|---|
| OpenSora-v1.2 | 51.6 | 31.2 | 61.9 | 73.0 | 3.4 | 88.5 |
| Mochi 1 | 53.2 | 37.7 | 62.0 | 68.6 | 14.4 | 83.6 |
| OpenSora-Plan-v1.3.0 | 58.9 | 39.3 | 76.0 | 78.6 | 6.0 | 94.7 |
| CogVideoX-5B | 60.6 | 50.6 | 61.6 | 75.4 | 24.6 | 91.0 |
| HunyuanVideo | 63.4 | 51.9 | 81.6 | 65.8 | 26.1 | 96.3 |
| Wan2.1 | 78.4 | 62.8 | 84.2 | 66.0 | 17.9 | 97.8 |
git clone https://github.com/Ran0618/VMBench.git
cd VMBench
# create conda environment
conda create -n VMBench python=3.10
pip install --upgrade setuptools
pip install torch==2.5.1 torchvision==0.20.1
# Install Grounded-Segment-Anything module
cd Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install -r requirements.txt
# Install Groudned-SAM-2 module
cd ../Grounded-SAM-2
pip install -e .
# Install MMPose toolkit
pip install -U openmim
mim install mmengine
mim install "mmcv==2.1.0"
mim install "mmdet==3.2.0"
cd ../mmpose
pip install -r requirements.txt
pip install -v -e .
# Install Q-Align module
cd ../Q-Align
pip install -e .
# Install VideoMAEv2 module
cd ../VideoMAEv2
pip install -r requirements.txt
cd ..
pip install -r requirements.txtPlace the pre-trained checkpoint files in the .cache directory.
You can download our model's checkpoints are from our HuggingFace repository 🤗.
You also need to download the checkpoints for Q-Align 🤗 and BERT 🤗 from their respective HuggingFace repositories
mkdir .cache
huggingface-cli download GD-ML/VMBench --local-dir .cache/
huggingface-cli download q-future/one-align --local-dir .cache/
huggingface-cli download google-bert/bert-base-uncased --local-dir .cache/Please organize the pretrained models in this structure:
VMBench/.cache
├── google-bert
│ └── bert-base-uncased
│ ├── LICENSE
│ ......
├── groundingdino_swinb_cogcoor.pth
├── q-future
│ └── one-align
│ ├── README.md
│ ......
├── sam2.1_hiera_large.pt
├── sam_vit_h_4b8939.pth
├── scaled_offline.pth
└── vit_g_vmbench.ptGenerate videos of your model using the 1050 prompts provided in prompts/prompts.txt or prompts/prompts.json and organize them in the following structure:
VMBench/eval_results/videos
├── 0001.mp4
├── 0002.mp4
...
└── 1050.mp4Note: Ensure that you maintain the correspondence between prompts and video sequence numbers. The index for each prompt can be found in the prompts/prompts.json file.
You can follow us sample_video_demo.py to generate videos. Or you can put the results video named index into your own folder.
To evaluate generated videos using the VMBench, run the following command:
bash evaluate.sh your_videos_folderThe evaluation results for each video will be saved in the ./eval_results/${current_time}/results.json. Scores for each dimension will be saved as ./eval_results/${current_time}/scores.csv.
We conducted a test using the following configuration:
- Model: CogVideoX-5B
- Number of Videos: 1,050
- Frames per Video: 49
- Frame Rate: 8 FPS
Here are the time measurements for each evaluation metric:
| Metric | Time Taken |
|---|---|
| PAS (Perceptible Amplitude Score) | 45 minutes |
| OIS (Object Integrity Score) | 30 minutes |
| TCS (Temporal Coherence Score) | 2 hours |
| MSS (Motion Smoothness Score) | 2.5 hours |
| CAS (Commonsense Adherence Score) | 1 hour |
Total Evaluation Time: 6 hours and 45 minutes
We would like to express our gratitude to the following open-source repositories that our work is based on: GroundedSAM, GroundedSAM2, Co-Tracker, MMPose, Q-Align, VideoMAEv2, VideoAlign. Their contributions have been invaluable to this project.
The VMBench is licensed under Apache-2.0 license. You are free to use our codes for research purpose.
If you find our repo useful for your research, please consider citing our paper:
@article{ling2025vmbench,
title={VMBench: A Benchmark for Perception-Aligned Video Motion Generation},
author={Ling, Xinran and Zhu, Chen and Wu, Meiqi and Li, Hangyu and Feng, Xiaokun and Yang, Cundian and Hao, Aiming and Zhu, Jiashu and Wu, Jiahong and Chu, Xiangxiang},
journal={arXiv preprint arXiv:2503.10076},
year={2025}
}

