Skip to content

snumprlab/vector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VECTOR: What Happens When (WACV 2026)

Official implementation of "What Happens When: Learning Temporal Orders of Events in Videos" (WACV 2026).

Daechul Ahn* | Yura Choi* | Hyeonbeom Choi* | Seongwon Cho | San Kim | Jonghyun Choi

* Equal contribution    † Corresponding author

[Paper] [Project Page]

Overview of VECTOR Benchmark

Overview of MECoT

Setup

Environment

docker pull hyeonbeomchoi/vector

Dataset (Kinetics-700)

cd kinetics-dataset
bash k700_2020_downloader.sh
bash k700_2020_extractor.sh

Note: We synthesize the VECTOR benchmark from Kinetics-700_2020 validation videos. You can find task-specific JSONL files under kinetics-dataset/kinetics_jsonl/.

Model Checkpoints

# LLaVA-OneVision (baseline)
huggingface-cli download lmms-lab/llava-onevision-qwen2-7b-ov \
  --local-dir checkpoints/llava-onevision-qwen2-7b-ov

# LLaVA-OneVision finetuned on detailed multi-event descriptions (ours)
huggingface-cli download SNUMPR/llava-onevision-qwen2-7b-ov-multi-event \
  --local-dir checkpoints/llava-onevision-qwen2-7b-ov-multi-event

Evaluation

Single GPU (VRAM >40GB) is enough to run inference. Multiple GPUs can be specified (comma-separated) for faster parallel inference.

LLaVA-OneVision (baseline, 1-step)

bash scripts/run_pipeline.sh --gpus [GPU_IDS] --task_id [1-5] --level [1|2]
# e.g., bash scripts/run_pipeline.sh --gpus 0 --task_id 1 --level 1

MECoT (ours, 2-step)

bash scripts/run_pipeline.sh --gpus [GPU_IDS] --task_id [1-5] --level [1|2] --mecot
# e.g., bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id 2 --level 2 --mecot

All Tasks

# LLaVA-OneVision (default)
for TASK in 1 2 3 4 5; do
  for LEVEL in 1 2; do
    bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id $TASK --level $LEVEL
  done
done

# MECoT
for TASK in 1 2 3 4 5; do
  for LEVEL in 1 2; do
    bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id $TASK --level $LEVEL --mecot
  done
done
Task Description Levels
1 Event Sequencing L1: 4 clips, L2: 8 clips
2 Relative Event Sequencing L1: 4 clips, L2: 8 clips
3 Event Position Identification (single/double/triple) L1: 4 clips, L2: 8 clips
4 Discordant Semantic Group Position Identification L1: 4 clips, L2: 8 clips
5 Discordant Event Position Identification L1: ABABAB+ABCABC
L2: ABABABAB+ABCABCABC

Results are saved under results/<checkpoint_name>/.

Citation

@inproceedings{ahn2026whathappenswhen,
  title={What Happens When: Learning Temporal Orders of Events in Videos},
  author={Daechul Ahn and Yura Choi and Hyeonbeom Choi and Seongwon Cho and San Kim and Jonghyun Choi},
  booktitle = {WACV},
  year={2026}
}

About

Official Implementation of What Happens When: Learning Temporal Orders of Events in Videos (WACV'26)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors