Skip to content

Vixel2006/BoC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Bag of Concepts (BoC): Learning Discrete Concepts for Multimodal Generation

Python 3.8+ JAX License

Bag of Concepts (BoC) is a novel approach to multimodal generation that learns a shared discrete concept space between images and text through vector quantization, enabling bidirectional cross-modal generation.

πŸ”¬ Abstract

We present Bag of Concepts (BoC), a multimodal architecture that learns a discrete codebook of visual and textual concepts for cross-modal generation. Unlike continuous latent space methods, BoC uses vector quantization (VQ) to create an interpretable, discrete representation that bridges vision and language. Our approach consists of:

  1. Image Encoding: Vision Transformer (ViT) β†’ VQ Codebook
  2. Text Encoding: Transformer β†’ VQ Codebook (shared)
  3. Cross-Modal Alignment: InfoNCE contrastive loss
  4. Bidirectional Generation: VQ β†’ VAE Decoder (images) or Transformer Decoder (text)

Key Features:

  • 🎯 Discrete Concept Space: Interpretable codebook of 512-1024 concepts
  • πŸ”„ Bidirectional Generation: Textβ†’Image and Imageβ†’Text
  • πŸ›‘οΈ Codebook Collapse Mitigation: 4 complementary strategies (EMA, entropy loss, code reset, commitment loss)
  • πŸ“Š Multi-Phase Training: Curriculum learning for stable concept formation

πŸ“‹ Table of Contents


πŸš€ Installation

Quick Setup (Recommended)

# Clone repository
git clone https://github.com/Vixel2006/BoC.git
cd BoC

# Run automated setup
chmod +x setup.sh
./setup.sh

# Download datasets (choose one or both)
python scripts/download_datasets.py --dataset coco --output-dir ./data
python scripts/download_datasets.py --dataset flickr30k --output-dir ./data

Manual Setup

# Clone repository
git clone https://github.com/Vixel2006/BoC.git
cd BoC

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate

# Install dependencies
pip install -r requirements.txt

# Download datasets
python scripts/download_datasets.py --dataset coco --output-dir ./data

Requirements:

  • Python 3.8+
  • JAX with CUDA support (for GPU training)
  • 16GB+ GPU RAM (for base model)
  • ~200GB storage (for COCO dataset)

⚑ Quick Start

Training

# Train on Flickr30k with base configuration
python main.py train \
    --dataset flickr30k \
    --data-root ./data/flickr30k \
    --config base \
    --output-dir ./experiments

# Train on MS COCO with large configuration
python main.py train \
    --dataset coco \
    --data-root ./data/coco \
    --config large \
    --batch-size 64

Inference

# Generate image from text
python main.py generate-image \
    --checkpoint ./experiments/best/checkpoints/phase_2 \
    --text "A dog playing in a sunny park" \
    --output generated.png

# Generate caption from image
python main.py generate-text \
    --checkpoint ./experiments/best/checkpoints/phase_2 \
    --image ./test_image.jpg \
    --temperature 0.9

πŸ“¦ Dataset Preparation

Automated Download (Recommended)

# Download MS COCO (automatic)
python scripts/download_datasets.py --dataset coco --output-dir ./data

# Setup Flickr30k (semi-automatic - requires Kaggle account)
python scripts/download_datasets.py --dataset flickr30k --output-dir ./data

# Download both
python scripts/download_datasets.py --all --output-dir ./data

# Verify datasets
python scripts/download_datasets.py --verify-only --dataset all --output-dir ./data

The script will:

  • βœ… MS COCO: Automatically download images and annotations (~25GB)
  • ⚠️ Flickr30k: Guide you through download (requires Kaggle account or manual download)

Manual Setup

Flickr30k (click to expand)
  1. Download Flickr30k images and captions
  2. Organize as:
data/flickr30k/
β”œβ”€β”€ flickr30k_images/
β”‚   β”œβ”€β”€ 1000092795.jpg
β”‚   └── ...
└── flickr30k_annotations/
    β”œβ”€β”€ train.json
    β”œβ”€β”€ val.json
    └── test.json

Annotation format:

[
  {
    "image_id": "1000092795.jpg",
    "captions": [
      "Two young guys with shaggy hair look at their hands...",
      "Two young, White males are outside near many bushes.",
      ...
    ]
  }
]

MS COCO

  1. Download COCO 2017 dataset
  2. Organize as:
data/coco/
β”œβ”€β”€ train2017/
β”œβ”€β”€ val2017/
└── annotations/
    β”œβ”€β”€ captions_train2017.json
    └── captions_val2017.json

Uses official COCO format (no conversion needed).


πŸŽ“ Training

Three-Phase Training Curriculum

BoC uses a progressive training strategy:

Phase 1: Image Autoencoder (50K steps)

Trains ViT β†’ VQ β†’ VAE pipeline to:

  • Establish stable codebook
  • Learn image reconstruction
  • Prevent codebook collapse
python main.py train --dataset flickr30k --data-root ./data/flickr30k --phase 1

Phase 2: Text Alignment (50K steps)

Trains text encoder and decoder with:

  • Text autoencoder (Text β†’ VQ β†’ Text)
  • InfoNCE alignment loss
  • Shared concept space learning
python main.py train --dataset flickr30k --data-root ./data/flickr30k --phase 2 \
    --resume-checkpoint ./experiments/.../checkpoints/phase_1/ckpt_50000

Phase 3: Joint Fine-tuning (20K steps, optional)

End-to-end optimization of all components.

python main.py train --dataset flickr30k --data-root ./data/flickr30k --phase 3

Model Configurations

Config Embed Dim Codebook Layers Params GPU Memory
Small 256 256 4 ~20M 8GB
Base 384 512 6 ~50M 16GB
Large 768 1024 12 ~200M 32GB

πŸ“Š Evaluation

python main.py eval \
    --dataset coco \
    --data-root ./data/coco \
    --checkpoint ./experiments/best/checkpoints/phase_2 \
    --split test \
    --output-file results.json

Metrics:

  • Image Reconstruction: PSNR, SSIM
  • Text Reconstruction: Perplexity, BLEU
  • Cross-Modal Retrieval: Recall@1, Recall@5, Recall@10
  • Image Quality: FID, IS
  • Codebook Usage: Perplexity, Active Codes Ratio

🎨 Inference

Text-to-Image Generation

from src.models import BoCModel
from src.data import SimpleTokenizer

# Load model and tokenizer
model = BoCModel(...)
tokenizer = SimpleTokenizer.load("tokenizer.pkl")

# Generate
text = "A beautiful sunset over mountains"
tokens = tokenizer.encode(text)
image = model.text_to_image(tokens)

Image-to-Text Generation

from src.utils import load_image

# Load image
image = load_image("photo.jpg")

# Generate caption
caption = model.image_to_text(image, max_length=128)
print(tokenizer.decode(caption))

πŸ—οΈ Architecture

graph TB
    A[Images] -->|ViT Encoder| B[VQ Codebook<br/>512 Concepts]
    C[Text] -->|Transformer Encoder| B
    B -->|VAE Decoder| D[Generated Images]
    B -->|Transformer Decoder| E[Generated Text]
    
    A -.->|InfoNCE Loss| C
    
    style B fill:#ff6b6b
    style A fill:#4ecdc4
    style C fill:#4ecdc4
    style D fill:#95e1d3
    style E fill:#95e1d3
Loading

Component Details

Vector Quantization Layer:

  • Codebook size: 512 (base), 1024 (large)
  • EMA-based updates (decay=0.99)
  • Codebook collapse mitigation:
    • Dead code reset (threshold=0.01)
    • Entropy regularization (weight=0.1)
    • Commitment loss (weight=0.25)

Vision Encoder:

  • ViT with 16Γ—16 patches
  • 6-12 transformer layers
  • 384-768 dimensional embeddings

Text Encoder:

  • Standard transformer encoder
  • Masked attention for padding
  • Learned positional embeddings

Decoders:

  • VAE: Transpose conv with residual blocks
    • Text: Autoregressive transformer decoder

πŸ“‚ Project Structure

BoC/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/          # Neural network architectures
β”‚   β”œβ”€β”€ training/        # Training loops and losses
β”‚   β”œβ”€β”€ data/            # Dataset loaders
β”‚   └── utils/           # Utilities
β”œβ”€β”€ main.py              # CLI interface
β”œβ”€β”€ test_model.py        # Component tests
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── STRUCTURE.md         # Detailed code structure

About

A research about fusing different modalities into modality agnostic concepts for concept based reasoning and generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors