This project builds a deep learning system that generates natural language captions for images by combining CNN-based visual feature extraction and RNN-based language modeling.
- Extracts image features using InceptionV3 (pretrained on ImageNet).
- Generates captions using a CNN + LSTM model.
- Trained and validated on a structured version of the Flickr8k dataset.
- Generates captions for test images and exports results to a CSV file.
The dataset used for this project (Assignment 2 files.zip) is hosted on google drive due to size limits.
π Download from Google Drive
Make sure to keep the folder zipped, since the first cell of the notebook will unzip it.
Assignment 2 files/
βββ train/ # Training images
βββ val/ # Validation images
βββ test/ # Test images (no captions)
βββ train.txt # Training image-caption pairs
βββ val.txt # Validation image-caption pairs
-
Preprocessing
- Tokenizes and sequences captions.
- Adds
<start>and<end>tokens. - Computes
max_lengthand vocabulary size.
-
Feature Extraction
- Uses InceptionV3 to extract 2048-d image features.
- Features are cached using
.pklfiles for faster reuse.
-
Model Architecture
- Dense layer for image features.
- Embedding + LSTM for caption input.
- Combined output passed through Dense layers to predict the next word.
-
Training
- Uses
sparse_categorical_crossentropyas the loss function. - Includes EarlyStopping (
patience=15) to prevent overfitting. - Monitors both training and validation loss.
- Uses
-
Caption Generation
- Custom
generate_caption()function builds captions word-by-word. - Results are saved to
submission.csvin the format: image_id,caption --> 123456.jpg,A man is riding a bike.
- Custom
- Python 3
- TensorFlow / Keras
- InceptionV3
- NumPy, pandas
Built using TensorFlow and Google Colab.