Bag of Concepts (BoC) is a novel approach to multimodal generation that learns a shared discrete concept space between images and text through vector quantization, enabling bidirectional cross-modal generation.
We present Bag of Concepts (BoC), a multimodal architecture that learns a discrete codebook of visual and textual concepts for cross-modal generation. Unlike continuous latent space methods, BoC uses vector quantization (VQ) to create an interpretable, discrete representation that bridges vision and language. Our approach consists of:
- Image Encoding: Vision Transformer (ViT) β VQ Codebook
- Text Encoding: Transformer β VQ Codebook (shared)
- Cross-Modal Alignment: InfoNCE contrastive loss
- Bidirectional Generation: VQ β VAE Decoder (images) or Transformer Decoder (text)
Key Features:
- π― Discrete Concept Space: Interpretable codebook of 512-1024 concepts
- π Bidirectional Generation: TextβImage and ImageβText
- π‘οΈ Codebook Collapse Mitigation: 4 complementary strategies (EMA, entropy loss, code reset, commitment loss)
- π Multi-Phase Training: Curriculum learning for stable concept formation
- Installation
- Quick Start
- Dataset Preparation
- Training
- Evaluation
- Inference
- Architecture
- Experimental Results
- Citation
# Clone repository
git clone https://github.com/Vixel2006/BoC.git
cd BoC
# Run automated setup
chmod +x setup.sh
./setup.sh
# Download datasets (choose one or both)
python scripts/download_datasets.py --dataset coco --output-dir ./data
python scripts/download_datasets.py --dataset flickr30k --output-dir ./data# Clone repository
git clone https://github.com/Vixel2006/BoC.git
cd BoC
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\\Scripts\\activate
# Install dependencies
pip install -r requirements.txt
# Download datasets
python scripts/download_datasets.py --dataset coco --output-dir ./dataRequirements:
- Python 3.8+
- JAX with CUDA support (for GPU training)
- 16GB+ GPU RAM (for base model)
- ~200GB storage (for COCO dataset)
# Train on Flickr30k with base configuration
python main.py train \
--dataset flickr30k \
--data-root ./data/flickr30k \
--config base \
--output-dir ./experiments
# Train on MS COCO with large configuration
python main.py train \
--dataset coco \
--data-root ./data/coco \
--config large \
--batch-size 64# Generate image from text
python main.py generate-image \
--checkpoint ./experiments/best/checkpoints/phase_2 \
--text "A dog playing in a sunny park" \
--output generated.png
# Generate caption from image
python main.py generate-text \
--checkpoint ./experiments/best/checkpoints/phase_2 \
--image ./test_image.jpg \
--temperature 0.9# Download MS COCO (automatic)
python scripts/download_datasets.py --dataset coco --output-dir ./data
# Setup Flickr30k (semi-automatic - requires Kaggle account)
python scripts/download_datasets.py --dataset flickr30k --output-dir ./data
# Download both
python scripts/download_datasets.py --all --output-dir ./data
# Verify datasets
python scripts/download_datasets.py --verify-only --dataset all --output-dir ./dataThe script will:
- β MS COCO: Automatically download images and annotations (~25GB)
β οΈ Flickr30k: Guide you through download (requires Kaggle account or manual download)
Flickr30k (click to expand)
- Download Flickr30k images and captions
- Organize as:
data/flickr30k/
βββ flickr30k_images/
β βββ 1000092795.jpg
β βββ ...
βββ flickr30k_annotations/
βββ train.json
βββ val.json
βββ test.json
Annotation format:
[
{
"image_id": "1000092795.jpg",
"captions": [
"Two young guys with shaggy hair look at their hands...",
"Two young, White males are outside near many bushes.",
...
]
}
]- Download COCO 2017 dataset
- Organize as:
data/coco/
βββ train2017/
βββ val2017/
βββ annotations/
βββ captions_train2017.json
βββ captions_val2017.json
Uses official COCO format (no conversion needed).
BoC uses a progressive training strategy:
Trains ViT β VQ β VAE pipeline to:
- Establish stable codebook
- Learn image reconstruction
- Prevent codebook collapse
python main.py train --dataset flickr30k --data-root ./data/flickr30k --phase 1Trains text encoder and decoder with:
- Text autoencoder (Text β VQ β Text)
- InfoNCE alignment loss
- Shared concept space learning
python main.py train --dataset flickr30k --data-root ./data/flickr30k --phase 2 \
--resume-checkpoint ./experiments/.../checkpoints/phase_1/ckpt_50000End-to-end optimization of all components.
python main.py train --dataset flickr30k --data-root ./data/flickr30k --phase 3| Config | Embed Dim | Codebook | Layers | Params | GPU Memory |
|---|---|---|---|---|---|
| Small | 256 | 256 | 4 | ~20M | 8GB |
| Base | 384 | 512 | 6 | ~50M | 16GB |
| Large | 768 | 1024 | 12 | ~200M | 32GB |
python main.py eval \
--dataset coco \
--data-root ./data/coco \
--checkpoint ./experiments/best/checkpoints/phase_2 \
--split test \
--output-file results.jsonMetrics:
- Image Reconstruction: PSNR, SSIM
- Text Reconstruction: Perplexity, BLEU
- Cross-Modal Retrieval: Recall@1, Recall@5, Recall@10
- Image Quality: FID, IS
- Codebook Usage: Perplexity, Active Codes Ratio
from src.models import BoCModel
from src.data import SimpleTokenizer
# Load model and tokenizer
model = BoCModel(...)
tokenizer = SimpleTokenizer.load("tokenizer.pkl")
# Generate
text = "A beautiful sunset over mountains"
tokens = tokenizer.encode(text)
image = model.text_to_image(tokens)from src.utils import load_image
# Load image
image = load_image("photo.jpg")
# Generate caption
caption = model.image_to_text(image, max_length=128)
print(tokenizer.decode(caption))graph TB
A[Images] -->|ViT Encoder| B[VQ Codebook<br/>512 Concepts]
C[Text] -->|Transformer Encoder| B
B -->|VAE Decoder| D[Generated Images]
B -->|Transformer Decoder| E[Generated Text]
A -.->|InfoNCE Loss| C
style B fill:#ff6b6b
style A fill:#4ecdc4
style C fill:#4ecdc4
style D fill:#95e1d3
style E fill:#95e1d3
Vector Quantization Layer:
- Codebook size: 512 (base), 1024 (large)
- EMA-based updates (decay=0.99)
- Codebook collapse mitigation:
- Dead code reset (threshold=0.01)
- Entropy regularization (weight=0.1)
- Commitment loss (weight=0.25)
Vision Encoder:
- ViT with 16Γ16 patches
- 6-12 transformer layers
- 384-768 dimensional embeddings
Text Encoder:
- Standard transformer encoder
- Masked attention for padding
- Learned positional embeddings
Decoders:
- VAE: Transpose conv with residual blocks
- Text: Autoregressive transformer decoder
BoC/
βββ src/
β βββ models/ # Neural network architectures
β βββ training/ # Training loops and losses
β βββ data/ # Dataset loaders
β βββ utils/ # Utilities
βββ main.py # CLI interface
βββ test_model.py # Component tests
βββ requirements.txt
βββ README.md
βββ STRUCTURE.md # Detailed code structure