This is a modified implementation of "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts" to be used for Text-Image search especially on a Text based Person Reid dataset (RSTPReid). The model is evaluated on three datasets, namely - MSCOCO, Flickr30k and RSTPReid. This documentation covers how to evaluate the model on all of these datasets as well as train the model on RSTPReid dataset.
Set up the conda environment
conda env create -f environment.yml
conda activate xvlm-
Flickr-30k
-
Zero-shot
python3 run.py --task "itr_flickr" --dist "1" --evaluate --load_pretrained --output_dir "output/itr_eval_flickr" --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override 'configs/itr_flickr/config_zeroshot.yaml' > output/itr_eval_flickr/output.txt
-
Fine Tuned
python3 run.py --task "itr_flickr" --dist "1" --evaluate --output_dir "output/itr_eval_flickr_finetuned" --checkpoint "checkpoints/itr_flickr/checkpoint_best.pth"
-
-
MSCOCO-5k
-
Zero-shot
python3 run.py --task "itr_coco" --dist "1" --evaluate --output_dir "output/itr_eval_mscoco_zeroshot" --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override checkpoints/itr_coco/config_zeroshot.yaml' --load_pretrained > output/itr_eval_mscoco_zeroshot/output.txt
-
Fine Tuned
python3 run.py --task "itr_coco" --dist "1" --evaluate --output_dir "output/itr_eval_mscoco_finetuned" --checkpoint "checkpoints/itr_coco/checkpoint_9.pth" --config_override "checkpoints/itr_coco/config.yaml"
-
-
RSTPReid
-
Fully zeroshot
mkdir output/itr_rstpreid_zeroshot python3 run.py --task "itr_rstpreid" --dist "1" --evaluate --load_pretrained --output_dir "output/itr_rstpreid_zeroshot" --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override 'configs/itr_rstpreid/config_zeroshot.yaml' > output/itr_rstpreid_zeroshot/output.txt
-
Fine Tuned on Flickr30k
mkdir output/itr_rstpreid_zeroshot_flickr30k python3 run.py --task "itr_rstpreid" --dist "1" --evaluate --output_dir "output/itr_rstpreid_zeroshot_flickr30k" --checkpoint "checkpoints/itr_flickr/checkpoint_best.pth" --config_override 'configs/itr_rstpreid/config.yaml' > output/itr_rstpreid_zeroshot_flickr30k/output.txt
-
Fine Tuned on MSCOCO
mkdir output/itr_rstpreid_zeroshot_mscoco python3 run.py --task "itr_rstpreid" --dist "1" --evaluate --output_dir "output/itr_rstpreid_zeroshot_mscoco" --checkpoint "checkpoints/itr_coco/checkpoint_9.pth" --config_override 'configs/itr_rstpreid/config.yaml' > output/itr_rstpreid_zeroshot_mscoco/output.txt
-
-
Fine Tuned on RSTPReid (from zeroshot)
mkdir output/itr_rstpreid_finetune_from_zeroshot python3 run.py --task "itr_rstpreid" --dist "f4" --output_dir "output/itr_rstpreid_finetune_from_zeroshot" --load_pretrained --checkpoint "checkpoints/16m_base_model_state_step_199999.th" --config_override 'configs/itr_rstpreid/config_zeroshot.yaml' --master_port_override 12347 > output/itr_rstpreid_finetune_from_zeroshot/output.txt
-
Fine Tuned on RSTPReid (from COCO)
mkdir output/itr_rstpreid_finetune_from_coco python3 run.py --task "itr_rstpreid" --dist "f4" --output_dir "output/itr_rstpreid_finetune_from_coco" --checkpoint "checkpoints/itr_coco/checkpoint_9.pth" --config_override 'configs/itr_rstpreid/config.yaml' --master_port_override 12346 > output/itr_rstpreid_finetune_from_coco/output3.txt
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.
- May 2022: The paper has been accepted by ICML 2022
- Jan 2022: Release official PyTorch implementation and X-VLM checkpoints
- Nov 2021: Release preprint in arXiv
X-VLM (216M parameters: swin-base + 6L text + 6L cross):
We are looking for interns / FTEs at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to zhangxinsong.0320@bytedance.com.
- Support several backbones
- vision encoder: deit / clip-vit / swin-transformer
- text encoder: bert / roberta
- Support apex O1 / O2 for pre-training
- Read from and write to HDFS
- Distributed training across nodes for both pre-training and fine-tuning
Please read the code for more details.
- Install python3 environment
pip3 install -r requirements.txt
- Download raw images from corresponding websites
- Download the json files we provided, which contains image read paths and captions and/or bbox annotations
- If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
- image encoder: clip-vit-base / swin-transformer-base
- text encoder: bert-base
- Organize these files like this (% is for pre-training only):
X-VLM/
data/
finetune/
refcoco+/*.json
*.json
%pretrain_4m/*.json
%swin_base_patch4_window7_224_22k.pth
%bert-base-uncased/
config.json
pytorch_model.bin
tokenizer_config.json
tokenizer.json
vocab.txt
images/
coco/
train2014/*.jpg
val2014/*.jpg
test2015/*.jpg
visualgenome/
image/*.jpg
nlvr2/
images/
train/0-99/*.png
dev/*.png
test1/*.png
%sbu/*.jpg
%cc-3m/*.jpg
python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"
For distributed training across nodes, see run.py for more details. To make a fair comparison of some recent works, we pre-trained X-VLM (4M/16M) for 200K steps.
🌟UPDATE: our multi-lingual multi-modal project Cross-View Language Modeling released the text of COCO+VG+SBU+CC3M and Object And Region Annotations in six languages. You can use english text for X-VLM pre-training.
All datasets we utilized are publicly available. We cannot re-distribute the data. So, please prepare the pre-training data by yourself. Here, we provide some data examples. Read the code dataset/pretrain_dataset.py/ImageTextJsonDataset & RegionTextJsonDataset for details.
# image-captions pairs, providing 'binary' or 'image_rpath'
{'caption': 'dog on bike in harajuku',
'binary': binary_encoding_of_the_image,
'image_rpath': local_rpath_of_the_image
}
# object/region annotations, providing 'binary' or 'image_rpath'
{'elems': [{'caption': 'lady sitting at table that has pizza on it', # str or list of str
'bb': [155, 0, 205, 131] # (x, y, w, h)
},
{'caption': 'window',
'attributes': 'closed', # str or list of str
'bb': [20, 130, 335, 185]
},
]
'caption': if_exist, # str or list of str
'binary': binary_encoding_of_the_image,
'image_rpath': local_rpath_of_the_image
}
X-VLM (4M, 200K steps)
X-VLM (16M, 200K steps)
Datasets for finetuning and checkpoints of X-VLM (4M/16M) can be downloaded in following links.
retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-weak
captioning-coco
# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
# train: if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results; it is only required by vqa & refcoco
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"
# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"
Specify "--task" to finetune on image-text retrieval, nlvr2, visual grounding, or image captioning. See run.py for details.
# adapt cross-modal encoder + MLM head -> lm decoder; subsequent fine-tuning is included
python3 run.py --task "coco_capt_domain" --dist "1" --output_dir "output/coco_capt_domain" --checkpoint "4m_base_model_state_step_199999.th"
# fine-tune only; evaluate is included
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning" --checkpoint "4m_base_finetune/coco_caption/lm_domain_pretrain.th"
# evaluate only
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning" --evaluate --checkpoint "4m_base_finetune/coco_caption/coco_capt_ft_epoch_4.th"
# further CIDEr optimization; evaluate is included
python3 run.py --task "coco_captioning_scst" --dist "1" --output_dir "output/coco_captioning_scst" --checkpoint "4m_base_finetune/coco_caption/coco_capt_ft_epoch_4.th"
# evaluate only
python3 run.py --task "coco_captioning" --dist "1" --output_dir "output/coco_captioning_scst" --evaluate --checkpoint "4m_base_finetune/coco_caption/coco_capt_cider_step_41000.th"
To make a fair comparison, we follow the previous works for fine-tuning. So, some scripts are based on ALBEF, OSCAR, and BLIP. We thank the authors for opening source their code.
VLUE is a new OOD benchmark to evaluate vision-language models, which has been accepted by ICML2022.
python3 run.py --task "eval_vlue_itr" --dist "1" --evaluate --output_dir "output/" --checkpoint "itr_coco/checkpoint_9.pth"
python3 run.py --task "eval_vlue_vqa" --dist "1" --evaluate --output_dir "output/" --checkpoint "vqa/model_state_epoch_9.th"
python3 run.py --task "eval_vlue_nlvr" --dist "1" --evaluate --output_dir "output/" --checkpoint "nlvr/nlvr_ft/checkpoint_best.pth"
python3 run.py --task "eval_vlue_refcoco" --dist "1" --evaluate --output_dir "output/" --checkpoint "refcoco_bbox/checkpoint_best.pth"
python3 run.py --task "eval_vlue_refcoco_weakly" --dist "1" --evaluate --output_dir "output/" --checkpoint "refcoco/checkpoint_best.pth"
If you find this repository useful, please considering giving ⭐ or citing:
@article{xvlm,
title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
journal={arXiv preprint arXiv:2111.08276},
year={2021}
}
For issues using this code, please submit a GitHub issue.


