This repository contains a full end-to-end Image Captioning system implemented in PyTorch. It uses VGG19 for extracting image features and a standard Encoder-Decoder recurrent neural network using an LSTM to generate captions. By default, it uses the Flickr8k dataset.
The model uses the classic "Show and Tell" architecture:
- Encoder: Converts 4096-dimensional VGG19 image features into a desired embedding space.
- Decoder: An LSTM that takes the projected image features as the initial input, followed by word embeddings of the true captions (teacher forcing) during training.
- Inference: Uses greedy search (picking the highest probability word) to construct the generated caption token-by-token.
Image-Captioning/
├── pyproject.toml # Project metadata and dependencies (managed by uv)
├── src/ # Reusable PyTorch source code
│ ├── dataset.py # Vocabulary builder and PyTorch Dataset class
│ ├── model.py # PyTorch Encoder and Decoder models
│ ├── inference.py # Greedy search and inference logic
│ └── preprocessing.py # Image transformations and utilities
├── notebooks/ # Jupyter Notebooks
│ └── image_captioning.ipynb # Main end-to-end training and inference notebook
├── dataset/ # Contains the captions and pre-computed features
This project is managed using uv, an extremely fast Python package and project manager.
# 1. Install uv (if you haven't already and are on macOS/Linux)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Add dependencies and create the environment
uv sync
# 3. Launch Jupyter Notebook
uv run jupyter notebookOpen the notebook located in notebooks/image_captioning.ipynb executing the setup instructions above. The notebook walks through:
- Loading the precomputed Parquet features.
- Building a vocabulary from the training text.
- Training the LSTM Decoder using PyTorch.
- Saving weights and validating against validation/test sets with sample image outputs.
You can monitor the training progress, loss, and other metrics using TensorBoard:
uv run tensorboard --logdir runs --port 6006Then, open http://localhost:6006/ in your browser.
- Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan: Show and Tell: A Neural Image Caption Generator
- Andrej Karpathy: CS231n Winter 2016 Lesson 10 Recurrent Neural Networks, Image Captioning and LSTM
The repository successfully replicates and trains using the following hyperparameters and environment on the Flickr8k dataset:
- 💻 Hardware Used: Apple Silicon (MPS Backend) natively triggered via
torch.device("mps") - ⚙️ Hyperparameters:
Embed Size: 512,Hidden Size: 512,Layers: 2Optimizer: AdamW (1e-3 LR, 1e-4 weight decay)Vocab Size: 4956
- 📈 Convergence: Validated with a 15-epoch cycle; testing found
Val Lossseamlessly converging from3.378down to2.555where thebest_model.pthwas natively saved.
Sample Inference Achieved:
- Actual Test String:
A large brown dog is jumping into the ocean . - Model Output generated:
A dog is running through the water .