VOLTA: Visiolinguistic Transformer Architectures

This is the implementation of the framework described in the paper:

Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki and Desmond Elliott. Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs. arXiv preprint arXiv:2011.15124, November 2020.

We provide the code for reproducing our results, preprocessed data and pretrained models.

Repository Setup

You can clone this repository with submodules included issuing:
git clone git@github.com:e-bug/volta

1. Create a fresh conda environment, and install all dependencies.

conda create -n volta python=3.6
conda activate volta
pip install -r requirements.txt

2. Install PyTorch

conda install pytorch=1.4.0 torchvision=0.5 cudatoolkit=10.1 -c pytorch

3. Install apex. If you use a cluster, you may want to first run commands like the following:

module load cuda/10.1.105
module load gcc/8.3.0-cuda

4. Setup the refer submodule for Referring Expression Comprehension:

cd tools/refer; make

5. Install this codebase as a package in this environment.

python setup.py develop

Data

Check out data/README.md for links to preprocessed data and data preparation steps.

Models

Check out MODELS.md for links to pretrained models and how to define new ones in VOLTA.

Model configuration files are stored in config/.

Training and Evaluation

We provide sample scripts to train (i.e. pretrain or fine-tune) and evaluate models in examples/. These include ViLBERT, LXMERT and VL-BERT as detailed in the original papers, as well as ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER as specified in our controlled study.

Task configuration files are stored in config_tasks/.

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you find our code/data/models or ideas useful in your research, please consider citing the paper:

@article{bugliarello-etal-2020-multimodal,
    title = "Multimodal Pretraining Unmasked: {U}nifying the Vision and Language {BERT}s",
    author = "Bugliarello, Emanuele  and
      Cotterell, Ryan and
      Okazaki, Naoaki and
      Elliott, Desmond",
    journal = "arXiv preprint arXiv:2011.15124"
    year = "2020",
    url = "https://arxiv.org/abs/2011.15124",
}

Acknowledgement

Our codebase heavily relies on these excellent repositories:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VOLTA: Visiolinguistic Transformer Architectures

Repository Setup

Data

Models

Training and Evaluation

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
apex		apex
config		config
config_tasks		config_tasks
data		data
examples		examples
scripts		scripts
tools/refer		tools/refer
volta		volta
.gitignore		.gitignore
LICENSE		LICENSE
MODELS.md		MODELS.md
README.md		README.md
ViLBERT_VOLTA.png		ViLBERT_VOLTA.png
eval_retrieval.py		eval_retrieval.py
eval_task.py		eval_task.py
requirements.txt		requirements.txt
setup.py		setup.py
train_concap.py		train_concap.py
train_task.py		train_task.py

Folders and files

Latest commit

History

Repository files navigation

VOLTA: Visiolinguistic Transformer Architectures

Repository Setup

Data

Models

Training and Evaluation

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages