Image Captioning with PyTorch & VGG19

This repository contains a full end-to-end Image Captioning system implemented in PyTorch. It uses VGG19 for extracting image features and a standard Encoder-Decoder recurrent neural network using an LSTM to generate captions. By default, it uses the Flickr8k dataset.

Architecture

The model uses the classic "Show and Tell" architecture:

Encoder: Converts 4096-dimensional VGG19 image features into a desired embedding space.
Decoder: An LSTM that takes the projected image features as the initial input, followed by word embeddings of the true captions (teacher forcing) during training.
Inference: Uses greedy search (picking the highest probability word) to construct the generated caption token-by-token.

Project Structure

Image-Captioning/
├── pyproject.toml               # Project metadata and dependencies (managed by uv)
├── src/                         # Reusable PyTorch source code
│   ├── dataset.py               # Vocabulary builder and PyTorch Dataset class
│   ├── model.py                 # PyTorch Encoder and Decoder models
│   ├── inference.py             # Greedy search and inference logic
│   └── preprocessing.py         # Image transformations and utilities
├── notebooks/                   # Jupyter Notebooks
│   └── image_captioning.ipynb   # Main end-to-end training and inference notebook
├── dataset/                     # Contains the captions and pre-computed features

Setup & Installation

This project is managed using uv, an extremely fast Python package and project manager.

# 1. Install uv (if you haven't already and are on macOS/Linux)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Add dependencies and create the environment
uv sync

# 3. Launch Jupyter Notebook
uv run jupyter notebook

Running the Model

Open the notebook located in notebooks/image_captioning.ipynb executing the setup instructions above. The notebook walks through:

Loading the precomputed Parquet features.
Building a vocabulary from the training text.
Training the LSTM Decoder using PyTorch.
Saving weights and validating against validation/test sets with sample image outputs.

Monitoring with TensorBoard

You can monitor the training progress, loss, and other metrics using TensorBoard:

uv run tensorboard --logdir runs --port 6006

Then, open http://localhost:6006/ in your browser.

References

Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan: Show and Tell: A Neural Image Caption Generator
Andrej Karpathy: CS231n Winter 2016 Lesson 10 Recurrent Neural Networks, Image Captioning and LSTM

Best Model Execution Statistics

The repository successfully replicates and trains using the following hyperparameters and environment on the Flickr8k dataset:

💻 Hardware Used: Apple Silicon (MPS Backend) natively triggered via torch.device("mps")
⚙️ Hyperparameters:
- Embed Size: 512, Hidden Size: 512, Layers: 2
- Optimizer: AdamW (1e-3 LR, 1e-4 weight decay)
- Vocab Size: 4956
📈 Convergence: Validated with a 15-epoch cycle; testing found Val Loss seamlessly converging from 3.378 down to 2.555 where the best_model.pth was natively saved.

Sample Inference Achieved:

Actual Test String: A large brown dog is jumping into the ocean .
Model Output generated: A dog is running through the water .

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
image_captioning.egg-info		image_captioning.egg-info
images		images
notebooks		notebooks
runs/src		runs/src
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
download.py		download.py
extract_features.py		extract_features.py
main.py		main.py
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning with PyTorch & VGG19

Architecture

Project Structure

Setup & Installation

Running the Model

Monitoring with TensorBoard

References

Best Model Execution Statistics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image Captioning with PyTorch & VGG19

Architecture

Project Structure

Setup & Installation

Running the Model

Monitoring with TensorBoard

References

Best Model Execution Statistics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages