VECTOR: What Happens When (WACV 2026)

Official implementation of "What Happens When: Learning Temporal Orders of Events in Videos" (WACV 2026).

_{* Equal contribution † Corresponding author}

[Paper] [Project Page]

Overview of VECTOR Benchmark

Overview of MECoT

Setup

Environment

docker pull hyeonbeomchoi/vector

Dataset (Kinetics-700)

cd kinetics-dataset
bash k700_2020_downloader.sh
bash k700_2020_extractor.sh

Note: We synthesize the VECTOR benchmark from Kinetics-700_2020 validation videos. You can find task-specific JSONL files under kinetics-dataset/kinetics_jsonl/.

Model Checkpoints

# LLaVA-OneVision (baseline)
huggingface-cli download lmms-lab/llava-onevision-qwen2-7b-ov \
  --local-dir checkpoints/llava-onevision-qwen2-7b-ov

# LLaVA-OneVision finetuned on detailed multi-event descriptions (ours)
huggingface-cli download SNUMPR/llava-onevision-qwen2-7b-ov-multi-event \
  --local-dir checkpoints/llava-onevision-qwen2-7b-ov-multi-event

Evaluation

Single GPU (VRAM >40GB) is enough to run inference. Multiple GPUs can be specified (comma-separated) for faster parallel inference.

LLaVA-OneVision (baseline, 1-step)

bash scripts/run_pipeline.sh --gpus [GPU_IDS] --task_id [1-5] --level [1|2]
# e.g., bash scripts/run_pipeline.sh --gpus 0 --task_id 1 --level 1

MECoT (ours, 2-step)

bash scripts/run_pipeline.sh --gpus [GPU_IDS] --task_id [1-5] --level [1|2] --mecot
# e.g., bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id 2 --level 2 --mecot

All Tasks

# LLaVA-OneVision (default)
for TASK in 1 2 3 4 5; do
  for LEVEL in 1 2; do
    bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id $TASK --level $LEVEL
  done
done

# MECoT
for TASK in 1 2 3 4 5; do
  for LEVEL in 1 2; do
    bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id $TASK --level $LEVEL --mecot
  done
done

Task	Description	Levels
1	Event Sequencing	L1: 4 clips, L2: 8 clips
2	Relative Event Sequencing	L1: 4 clips, L2: 8 clips
3	Event Position Identification (single/double/triple)	L1: 4 clips, L2: 8 clips
4	Discordant Semantic Group Position Identification	L1: 4 clips, L2: 8 clips
5	Discordant Event Position Identification	L1: ABABAB+ABCABC L2: ABABABAB+ABCABCABC

Results are saved under results/<checkpoint_name>/.

Citation

@inproceedings{ahn2026whathappenswhen,
  title={What Happens When: Learning Temporal Orders of Events in Videos},
  author={Daechul Ahn and Yura Choi and Hyeonbeom Choi and Seongwon Cho and San Kim and Jonghyun Choi},
  booktitle = {WACV},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
eval		eval
kinetics-dataset		kinetics-dataset
llava		llava
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VECTOR: What Happens When (WACV 2026)

Overview of VECTOR Benchmark

Overview of MECoT

Setup

Environment

Dataset (Kinetics-700)

Model Checkpoints

Evaluation

LLaVA-OneVision (baseline, 1-step)

MECoT (ours, 2-step)

All Tasks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VECTOR: What Happens When (WACV 2026)

Overview of VECTOR Benchmark

Overview of MECoT

Setup

Environment

Dataset (Kinetics-700)

Model Checkpoints

Evaluation

LLaVA-OneVision (baseline, 1-step)

MECoT (ours, 2-step)

All Tasks

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages