Official implementation of "What Happens When: Learning Temporal Orders of Events in Videos" (WACV 2026).
Daechul Ahn* | Yura Choi* | Hyeonbeom Choi* | Seongwon Cho | San Kim | Jonghyun Choi†
* Equal contribution † Corresponding author
![]() |
![]() |
docker pull hyeonbeomchoi/vectorcd kinetics-dataset
bash k700_2020_downloader.sh
bash k700_2020_extractor.shNote: We synthesize the VECTOR benchmark from Kinetics-700_2020 validation videos. You can find task-specific JSONL files under
kinetics-dataset/kinetics_jsonl/.
# LLaVA-OneVision (baseline)
huggingface-cli download lmms-lab/llava-onevision-qwen2-7b-ov \
--local-dir checkpoints/llava-onevision-qwen2-7b-ov
# LLaVA-OneVision finetuned on detailed multi-event descriptions (ours)
huggingface-cli download SNUMPR/llava-onevision-qwen2-7b-ov-multi-event \
--local-dir checkpoints/llava-onevision-qwen2-7b-ov-multi-eventSingle GPU (VRAM >40GB) is enough to run inference. Multiple GPUs can be specified (comma-separated) for faster parallel inference.
bash scripts/run_pipeline.sh --gpus [GPU_IDS] --task_id [1-5] --level [1|2]
# e.g., bash scripts/run_pipeline.sh --gpus 0 --task_id 1 --level 1bash scripts/run_pipeline.sh --gpus [GPU_IDS] --task_id [1-5] --level [1|2] --mecot
# e.g., bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id 2 --level 2 --mecot# LLaVA-OneVision (default)
for TASK in 1 2 3 4 5; do
for LEVEL in 1 2; do
bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id $TASK --level $LEVEL
done
done
# MECoT
for TASK in 1 2 3 4 5; do
for LEVEL in 1 2; do
bash scripts/run_pipeline.sh --gpus 0,1,2,3 --task_id $TASK --level $LEVEL --mecot
done
doneResults are saved under results/<checkpoint_name>/.
@inproceedings{ahn2026whathappenswhen,
title={What Happens When: Learning Temporal Orders of Events in Videos},
author={Daechul Ahn and Yura Choi and Hyeonbeom Choi and Seongwon Cho and San Kim and Jonghyun Choi},
booktitle = {WACV},
year={2026}
}

