Skip to content

adityaiitg/Image-Captioning

Repository files navigation

Image Captioning with PyTorch & VGG19

This repository contains a full end-to-end Image Captioning system implemented in PyTorch. It uses VGG19 for extracting image features and a standard Encoder-Decoder recurrent neural network using an LSTM to generate captions. By default, it uses the Flickr8k dataset.

Architecture

The model uses the classic "Show and Tell" architecture:

  • Encoder: Converts 4096-dimensional VGG19 image features into a desired embedding space.
  • Decoder: An LSTM that takes the projected image features as the initial input, followed by word embeddings of the true captions (teacher forcing) during training.
  • Inference: Uses greedy search (picking the highest probability word) to construct the generated caption token-by-token.

Project Structure

Image-Captioning/
├── pyproject.toml               # Project metadata and dependencies (managed by uv)
├── src/                         # Reusable PyTorch source code
│   ├── dataset.py               # Vocabulary builder and PyTorch Dataset class
│   ├── model.py                 # PyTorch Encoder and Decoder models
│   ├── inference.py             # Greedy search and inference logic
│   └── preprocessing.py         # Image transformations and utilities
├── notebooks/                   # Jupyter Notebooks
│   └── image_captioning.ipynb   # Main end-to-end training and inference notebook
├── dataset/                     # Contains the captions and pre-computed features

Setup & Installation

This project is managed using uv, an extremely fast Python package and project manager.

# 1. Install uv (if you haven't already and are on macOS/Linux)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Add dependencies and create the environment
uv sync

# 3. Launch Jupyter Notebook
uv run jupyter notebook

Running the Model

Open the notebook located in notebooks/image_captioning.ipynb executing the setup instructions above. The notebook walks through:

  1. Loading the precomputed Parquet features.
  2. Building a vocabulary from the training text.
  3. Training the LSTM Decoder using PyTorch.
  4. Saving weights and validating against validation/test sets with sample image outputs.

Monitoring with TensorBoard

You can monitor the training progress, loss, and other metrics using TensorBoard:

uv run tensorboard --logdir runs --port 6006

Then, open http://localhost:6006/ in your browser.

References

  1. Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan: Show and Tell: A Neural Image Caption Generator
  2. Andrej Karpathy: CS231n Winter 2016 Lesson 10 Recurrent Neural Networks, Image Captioning and LSTM

Best Model Execution Statistics

The repository successfully replicates and trains using the following hyperparameters and environment on the Flickr8k dataset:

  • 💻 Hardware Used: Apple Silicon (MPS Backend) natively triggered via torch.device("mps")
  • ⚙️ Hyperparameters:
    • Embed Size: 512, Hidden Size: 512, Layers: 2
    • Optimizer: AdamW (1e-3 LR, 1e-4 weight decay)
    • Vocab Size: 4956
  • 📈 Convergence: Validated with a 15-epoch cycle; testing found Val Loss seamlessly converging from 3.378 down to 2.555 where the best_model.pth was natively saved.

Sample Inference Achieved:

  • Actual Test String: A large brown dog is jumping into the ocean .
  • Model Output generated: A dog is running through the water .

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors