GitHub - CeeZh/VideoPlan: [WACV 2026] Official implementation for "Enhancing visual planning with auxiliary tasks and multi-token prediction"

This is official implementation for WACV 2026 paper: Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction.

🚀 Installation & Setup

conda create --name=videoplan python=3.10
conda activate videoplan

pip install -r requirements.txt
pip install ninja
pip install flash-attn --no-build-isolation

📁 Dataset & Feature Preparation

COIN / CrossTask

Download the dataset videos.
Put annotations under data/coin_annotations/ or data/crosstask_annotations/ (including train/test splits).

Ego4D

Download Ego4D clips and annotations according to Ego4D instructions.

Generate CLIP Visual Features

COIN / CrossTask:

python data/videoplan/scripts/gen_emb_parallel.py \
  --dataset dataset_name \         # coin or crosstask
  --video_base_path path_to_videos \
  --anno_path path_to_annotations  \
  --output_base_path data/feats \
  --num_chunks 32  # Launch 32 jobs on Slurm. Each job takes 1 GPU.

Ego4D:

python data/videoplan/scripts/gen_emb_parallel_ego4d.py \
  --dataset ego4d \
  --video_base_path path_to_videos \
  --anno_path path_to_annotations \
  --output_base_path data/feats \
  --num_chunks 32  # Launch 32 jobs on Slurm. Each job takes 1 GPU.

🛠 Instruction-Tuning Dataset Generation

COIN / CrossTask:

# auxiliary tasks, used in training stage 2
python data/videoplan/scripts/gen_dataset.py \
  --dataset dataset_name \
  --video_base_path path_to_videos \
  --anno_path path_to_annotations \
  --feat_base_path data/feats \
  --tasks Goal PP PPText VPAImage Goal LTA TextPP TextVPA TextGoal \
  --output_path path_to_generated_json

# VPA task, used in training  stage 3
python data/videoplan/scripts/gen_dataset.py \
  --dataset dataset_name \
  --video_base_path path_to_videos \
  --anno_path path_to_annotations \
  --feat_base_path data/feats \
  --tasks VPA \
  --output_path path_to_generated_json

Ego4D:

python scripts/gen_dataset_ego4d.py \
  --dataset ego4d \
  --video_base_path datasets/ego4d/clips \
  --feat_base_path data/feats \
  --anno_paths ego4d/annotations/fho_lta_train.json \
  --output_path path_to_generated_json.json

📂 Checkpoint Setup

Download pretrained model weights:

LLM backbone (e.g. Vicuna v1.5 7B https://huggingface.co/lmsys/vicuna-7b-v1.5)
Pre-trained MLP visual-to-LLM projection (“stage1”): https://cloud.tsinghua.edu.cn/d/6db5d02883124826aa6f/files/?p=%2Fcheckpoints%2Fvtimellm-vicuna-v1-5-7b.tar.gz

Place them under:

VideoPlan/
└── checkpoints/
    ├── vicuna-7b-v1.5/
    │     └── pytorch_model-*.bin
    │     └── ...   
    └── vtimellm-vicuna-v1-5-7b-stage1/
          └── mm_projector.bin

🧠 Training Pipeline

Three stages:

Feature Alignment — train only visual adapter (encoder + LLM frozen).
Auxiliary Task Pre-training — train on auxiliary tasks (e.g. goal prediction) with frozen encoder & adapter.
Primary Task Fine-Tuning (VPA / LTA) — with MTP enabled, fine-tune for the main task.

Scripts provided:

scripts/stage2_aux.sh
scripts/stage3_mtp.sh

✅ Evaluation

An example evaluation script: scripts/eval_coin_3.sh.

You can modify the dataset name or number of predicted steps to evaluate different configurations.

📄 Citation

If you use VideoPlan in your work, please cite:

@article{zhang2025enhancing,
  title={Enhancing visual planning with auxiliary tasks and multi-token prediction},
  author={Zhang, Ce and Song, Yale and Desai, Ruta and Iuzzolino, Michael Louis and Tighe, Joseph and Bertasius, Gedas and Kottur, Satwik},
  journal={arXiv preprint arXiv:2507.15130},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Installation & Setup

📁 Dataset & Feature Preparation

COIN / CrossTask

Ego4D

Generate CLIP Visual Features

🛠 Instruction-Tuning Dataset Generation

📂 Checkpoint Setup

🧠 Training Pipeline

✅ Evaluation

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
scripts		scripts
vtimellm		vtimellm
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚀 Installation & Setup

📁 Dataset & Feature Preparation

COIN / CrossTask

Ego4D

Generate CLIP Visual Features

🛠 Instruction-Tuning Dataset Generation

📂 Checkpoint Setup

🧠 Training Pipeline

✅ Evaluation

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages