This is official implementation for WACV 2026 paper: Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction.
conda create --name=videoplan python=3.10
conda activate videoplan
pip install -r requirements.txt
pip install ninja
pip install flash-attn --no-build-isolation- Download the dataset videos.
- Put annotations under
data/coin_annotations/ordata/crosstask_annotations/(including train/test splits).
- Download Ego4D clips and annotations according to Ego4D instructions.
COIN / CrossTask:
python data/videoplan/scripts/gen_emb_parallel.py \
--dataset dataset_name \ # coin or crosstask
--video_base_path path_to_videos \
--anno_path path_to_annotations \
--output_base_path data/feats \
--num_chunks 32 # Launch 32 jobs on Slurm. Each job takes 1 GPU.Ego4D:
python data/videoplan/scripts/gen_emb_parallel_ego4d.py \
--dataset ego4d \
--video_base_path path_to_videos \
--anno_path path_to_annotations \
--output_base_path data/feats \
--num_chunks 32 # Launch 32 jobs on Slurm. Each job takes 1 GPU.COIN / CrossTask:
# auxiliary tasks, used in training stage 2
python data/videoplan/scripts/gen_dataset.py \
--dataset dataset_name \
--video_base_path path_to_videos \
--anno_path path_to_annotations \
--feat_base_path data/feats \
--tasks Goal PP PPText VPAImage Goal LTA TextPP TextVPA TextGoal \
--output_path path_to_generated_json
# VPA task, used in training stage 3
python data/videoplan/scripts/gen_dataset.py \
--dataset dataset_name \
--video_base_path path_to_videos \
--anno_path path_to_annotations \
--feat_base_path data/feats \
--tasks VPA \
--output_path path_to_generated_jsonEgo4D:
python scripts/gen_dataset_ego4d.py \
--dataset ego4d \
--video_base_path datasets/ego4d/clips \
--feat_base_path data/feats \
--anno_paths ego4d/annotations/fho_lta_train.json \
--output_path path_to_generated_json.jsonDownload pretrained model weights:
- LLM backbone (e.g. Vicuna v1.5 7B https://huggingface.co/lmsys/vicuna-7b-v1.5)
- Pre-trained MLP visual-to-LLM projection (“stage1”): https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/files/?p=%2Fcheckpoints%2Fvtimellm-vicuna-v1-5-7b.tar.gz
Place them under:
VideoPlan/
└── checkpoints/
├── vicuna-7b-v1.5/
│ └── pytorch_model-*.bin
│ └── ...
└── vtimellm-vicuna-v1-5-7b-stage1/
└── mm_projector.bin
Three stages:
- Feature Alignment — train only visual adapter (encoder + LLM frozen).
- Auxiliary Task Pre-training — train on auxiliary tasks (e.g. goal prediction) with frozen encoder & adapter.
- Primary Task Fine-Tuning (VPA / LTA) — with MTP enabled, fine-tune for the main task.
Scripts provided:
scripts/stage2_aux.shscripts/stage3_mtp.sh
An example evaluation script: scripts/eval_coin_3.sh.
You can modify the dataset name or number of predicted steps to evaluate different configurations.
If you use VideoPlan in your work, please cite:
@article{zhang2025enhancing,
title={Enhancing visual planning with auxiliary tasks and multi-token prediction},
author={Zhang, Ce and Song, Yale and Desai, Ruta and Iuzzolino, Michael Louis and Tighe, Joseph and Bertasius, Gedas and Kottur, Satwik},
journal={arXiv preprint arXiv:2507.15130},
year={2025}
}