Skip to content

CeeZh/VideoPlan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is official implementation for WACV 2026 paper: Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction.

🚀 Installation & Setup

conda create --name=videoplan python=3.10
conda activate videoplan

pip install -r requirements.txt
pip install ninja
pip install flash-attn --no-build-isolation

📁 Dataset & Feature Preparation

COIN / CrossTask

  • Download the dataset videos.
  • Put annotations under data/coin_annotations/ or data/crosstask_annotations/ (including train/test splits).

Ego4D

  • Download Ego4D clips and annotations according to Ego4D instructions.

Generate CLIP Visual Features

COIN / CrossTask:

python data/videoplan/scripts/gen_emb_parallel.py \
  --dataset dataset_name \         # coin or crosstask
  --video_base_path path_to_videos \
  --anno_path path_to_annotations  \
  --output_base_path data/feats \
  --num_chunks 32  # Launch 32 jobs on Slurm. Each job takes 1 GPU.

Ego4D:

python data/videoplan/scripts/gen_emb_parallel_ego4d.py \
  --dataset ego4d \
  --video_base_path path_to_videos \
  --anno_path path_to_annotations \
  --output_base_path data/feats \
  --num_chunks 32  # Launch 32 jobs on Slurm. Each job takes 1 GPU.

🛠 Instruction-Tuning Dataset Generation

COIN / CrossTask:

# auxiliary tasks, used in training stage 2
python data/videoplan/scripts/gen_dataset.py \
  --dataset dataset_name \
  --video_base_path path_to_videos \
  --anno_path path_to_annotations \
  --feat_base_path data/feats \
  --tasks Goal PP PPText VPAImage Goal LTA TextPP TextVPA TextGoal \
  --output_path path_to_generated_json

# VPA task, used in training  stage 3
python data/videoplan/scripts/gen_dataset.py \
  --dataset dataset_name \
  --video_base_path path_to_videos \
  --anno_path path_to_annotations \
  --feat_base_path data/feats \
  --tasks VPA \
  --output_path path_to_generated_json

Ego4D:

python scripts/gen_dataset_ego4d.py \
  --dataset ego4d \
  --video_base_path datasets/ego4d/clips \
  --feat_base_path data/feats \
  --anno_paths ego4d/annotations/fho_lta_train.json \
  --output_path path_to_generated_json.json

📂 Checkpoint Setup

Download pretrained model weights:

Place them under:

VideoPlan/
└── checkpoints/
    ├── vicuna-7b-v1.5/
    │     └── pytorch_model-*.bin
    │     └── ...   
    └── vtimellm-vicuna-v1-5-7b-stage1/
          └── mm_projector.bin

🧠 Training Pipeline

Three stages:

  1. Feature Alignment — train only visual adapter (encoder + LLM frozen).
  2. Auxiliary Task Pre-training — train on auxiliary tasks (e.g. goal prediction) with frozen encoder & adapter.
  3. Primary Task Fine-Tuning (VPA / LTA) — with MTP enabled, fine-tune for the main task.

Scripts provided:

  • scripts/stage2_aux.sh
  • scripts/stage3_mtp.sh

✅ Evaluation

An example evaluation script: scripts/eval_coin_3.sh.

You can modify the dataset name or number of predicted steps to evaluate different configurations.

📄 Citation

If you use VideoPlan in your work, please cite:

@article{zhang2025enhancing,
  title={Enhancing visual planning with auxiliary tasks and multi-token prediction},
  author={Zhang, Ce and Song, Yale and Desai, Ruta and Iuzzolino, Michael Louis and Tighe, Joseph and Bertasius, Gedas and Kottur, Satwik},
  journal={arXiv preprint arXiv:2507.15130},
  year={2025}
}

About

[WACV 2026] Official implementation for "Enhancing visual planning with auxiliary tasks and multi-token prediction"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors