Skip to content

jack-matroid/X-VLM-extended

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matroid-X-VLM-extended

Introduction

This is a modified implementation of "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts" to be used for Text-Image search especially on a Text based Person Reid dataset (RSTPReid). The model is evaluated on three datasets, namely - MSCOCO, Flickr30k and RSTPReid. This documentation covers how to evaluate the model on all of these datasets as well as train the model on RSTPReid dataset.

Environment

Set up the conda environment

conda env create -f environment.yml
conda activate xvlm

Evaluation scripts

  • Flickr-30k

    • Zero-shot

      python3 run.py --task "itr_flickr" --dist "1" --evaluate --load_pretrained --output_dir "output/itr_eval_flickr" --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override 'configs/itr_flickr/config_zeroshot.yaml' > output/itr_eval_flickr/output.txt
    • Fine Tuned

      python3 run.py --task "itr_flickr" --dist "1" --evaluate --output_dir "output/itr_eval_flickr_finetuned" --checkpoint "checkpoints/itr_flickr/checkpoint_best.pth"
  • MSCOCO-5k

    • Zero-shot

      python3 run.py --task "itr_coco" --dist "1" --evaluate --output_dir "output/itr_eval_mscoco_zeroshot" --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override checkpoints/itr_coco/config_zeroshot.yaml' --load_pretrained > output/itr_eval_mscoco_zeroshot/output.txt
    • Fine Tuned

      python3 run.py --task "itr_coco" --dist "1" --evaluate --output_dir "output/itr_eval_mscoco_finetuned" --checkpoint "checkpoints/itr_coco/checkpoint_9.pth" --config_override "checkpoints/itr_coco/config.yaml"
  • RSTPReid

    • Fully zeroshot

      mkdir output/itr_rstpreid_zeroshot
      
      python3 run.py --task "itr_rstpreid" --dist "1" --evaluate --load_pretrained --output_dir "output/itr_rstpreid_zeroshot" --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override 'configs/itr_rstpreid/config_zeroshot.yaml' > output/itr_rstpreid_zeroshot/output.txt
    • Fine Tuned on Flickr30k

      mkdir output/itr_rstpreid_zeroshot_flickr30k
      
      python3 run.py --task "itr_rstpreid" --dist "1" --evaluate --output_dir "output/itr_rstpreid_zeroshot_flickr30k" --checkpoint "checkpoints/itr_flickr/checkpoint_best.pth" --config_override 'configs/itr_rstpreid/config.yaml' > output/itr_rstpreid_zeroshot_flickr30k/output.txt
    • Fine Tuned on MSCOCO

      mkdir output/itr_rstpreid_zeroshot_mscoco
      
      python3 run.py --task "itr_rstpreid" --dist "1" --evaluate --output_dir "output/itr_rstpreid_zeroshot_mscoco" --checkpoint "checkpoints/itr_coco/checkpoint_9.pth" --config_override 'configs/itr_rstpreid/config.yaml' > output/itr_rstpreid_zeroshot_mscoco/output.txt

Finetuning on RSTPReid

  • Fine Tuned on RSTPReid (from zeroshot)

    mkdir output/itr_rstpreid_finetune_from_zeroshot
    
    python3 run.py --task "itr_rstpreid" --dist "f4" --output_dir "output/itr_rstpreid_finetune_from_zeroshot" --load_pretrained --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override 'configs/itr_rstpreid/config_zeroshot.yaml' --master_port_override 12347 > output/itr_rstpreid_finetune_from_zeroshot/output.txt
  • Fine Tuned on RSTPReid (from COCO)

    mkdir output/itr_rstpreid_finetune_from_coco
    
    python3 run.py --task "itr_rstpreid" --dist "f4" --output_dir "output/itr_rstpreid_finetune_from_coco" --checkpoint "checkpoints/itr_coco/checkpoint_9.pth" --config_override 'configs/itr_rstpreid/config.yaml' --master_port_override 12346 > output/itr_rstpreid_finetune_from_coco/output3.txt

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

  • May 2022: The paper has been accepted by ICML 2022
  • Jan 2022: Release official PyTorch implementation and X-VLM checkpoints
  • Nov 2021: Release preprint in arXiv

X-VLM (216M parameters: swin-base + 6L text + 6L cross): PWC PWC PWC PWC PWC PWC PWC PWC PWC

Hiring

We are looking for interns / FTEs at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to zhangxinsong.0320@bytedance.com.

Features

  • Support several backbones
    • vision encoder: deit / clip-vit / swin-transformer
    • text encoder: bert / roberta
  • Support apex O1 / O2 for pre-training
  • Read from and write to HDFS
  • Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

  • Install python3 environment
pip3 install -r requirements.txt
  • Download raw images from corresponding websites
  • Download the json files we provided, which contains image read paths and captions and/or bbox annotations
  • If running pre-training scripts:
  • Organize these files like this (% is for pre-training only):
X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details. To make a fair comparison of some recent works, we pre-trained X-VLM (4M/16M) for 200K steps.

Data


🌟UPDATE: our multi-lingual multi-modal project Cross-View Language Modeling released the text of COCO+VG+SBU+CC3M and Object And Region Annotations in six languages. You can use english text for X-VLM pre-training.


All datasets we utilized are publicly available. We cannot re-distribute the data. So, please prepare the pre-training data by yourself. Here, we provide some data examples. Read the code dataset/pretrain_dataset.py/ImageTextJsonDataset & RegionTextJsonDataset for details.

# image-captions pairs, providing 'binary' or 'image_rpath' 
{'caption': 'dog on bike in harajuku', 
 'binary': binary_encoding_of_the_image, 
 'image_rpath': local_rpath_of_the_image
}


# object/region annotations, providing 'binary' or 'image_rpath' 
{'elems': [{'caption': 'lady sitting at table that has pizza on it',  # str or list of str  
            'bb': [155, 0, 205, 131]   # (x, y, w, h)
            }, 
           {'caption': 'window',  
            'attributes': 'closed',  # str or list of str 
            'bb': [20, 130, 335, 185]
            },
          ]
 'caption': if_exist,  # str or list of str 
 'binary': binary_encoding_of_the_image, 
 'image_rpath': local_rpath_of_the_image
}

Checkpoints

X-VLM (4M, 200K steps)
X-VLM (16M, 200K steps)

Finetune

Datasets for finetuning and checkpoints of X-VLM (4M/16M) can be downloaded in following links.

Data

download json files

Checkpoints and Logs (16M)

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-weak
captioning-coco

Checkpoints and Logs (4M)

4m-all-ft-ckpts.tar

Examples

# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"

# train: if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results; it is only required by vqa & refcoco 
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"

Specify "--task" to finetune on image-text retrieval, nlvr2, visual grounding, or image captioning. See run.py for details.

More Examples of Captioning:

# adapt cross-modal encoder + MLM head -> lm decoder; subsequent fine-tuning is included   
python3 run.py --task "coco_capt_domain" --dist "1" --output_dir "output/coco_capt_domain" --checkpoint "4m_base_model_state_step_199999.th"

# fine-tune only; evaluate is included 
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning" --checkpoint "4m_base_finetune/coco_caption/lm_domain_pretrain.th"
# evaluate only
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning" --evaluate --checkpoint "4m_base_finetune/coco_caption/coco_capt_ft_epoch_4.th"

# further CIDEr optimization; evaluate is included 
python3 run.py --task "coco_captioning_scst" --dist "1" --output_dir "output/coco_captioning_scst" --checkpoint "4m_base_finetune/coco_caption/coco_capt_ft_epoch_4.th"
# evaluate only
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning_scst" --evaluate --checkpoint "4m_base_finetune/coco_caption/coco_capt_cider_step_41000.th"

To make a fair comparison, we follow the previous works for fine-tuning. So, some scripts are based on ALBEF, OSCAR, and BLIP. We thank the authors for opening source their code.

Evaluation on VLUE

VLUE is a new OOD benchmark to evaluate vision-language models, which has been accepted by ICML2022.

python3 run.py --task "eval_vlue_itr" --dist "1" --evaluate  --output_dir "output/" --checkpoint "itr_coco/checkpoint_9.pth"

python3 run.py --task "eval_vlue_vqa" --dist "1" --evaluate  --output_dir "output/" --checkpoint "vqa/model_state_epoch_9.th"

python3 run.py --task "eval_vlue_nlvr" --dist "1" --evaluate  --output_dir "output/" --checkpoint "nlvr/nlvr_ft/checkpoint_best.pth"

python3 run.py --task "eval_vlue_refcoco" --dist "1" --evaluate  --output_dir "output/" --checkpoint "refcoco_bbox/checkpoint_best.pth"

python3 run.py --task "eval_vlue_refcoco_weakly" --dist "1" --evaluate  --output_dir "output/" --checkpoint "refcoco/checkpoint_best.pth"

Citation

If you find this repository useful, please considering giving ⭐ or citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues using this code, please submit a GitHub issue.

About

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%