A minimal CLIP reproduction in PyTorch.
This repository focuses on the core CLIP pipeline:
- text tokenizer
- image encoder
- text encoder
- contrastive loss
- training and retrieval evaluation
It is designed to be easy to read, easy to run, and easy to extend.
- GPU: vGPU 48GB x1
- Recommended Python: 3.10+
- Recommended PyTorch stack: CUDA 12.8 wheels (
cu128)
Create a clean environment first:
conda create -n mini_clip python=3.10 -y
conda activate mini_clip
pip install -r requirements.txtIf you are on an vGPU, use the CUDA 12.8 build of PyTorch pinned in requirements.txt.
- Download COCO 2017 images and captions:
# Example: download to data/raw/COCO2017/
mkdir -p data/raw/COCO2017
cd data/raw/COCO2017
# Train
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip train2017.zip
unzip annotations_trainval2017.zip
# Val
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip- Prepare JSON splits:
python prepare_coco2017.pyThis generates data/train.json and data/val.json.
- Download Flickr8k to
data/raw/flickr8k/:
data/raw/flickr8k/
Images/ # image files
captions.txt # captions file
- Prepare JSON splits:
python prepare_flickr8k.pyThis generates data/train.json, data/val.json, and data/test.json.
The training code expects JSON files with image-text pairs:
[
{
"image_path": "raw/COCO2017/train2017/000000000009.jpg",
"text": "a man riding a bicycle"
}
]python train.py \
--data data/train.json \
--val-data data/val.json \
--epochs 10 \
--batch-size 32 \
--num-workers 4Recommended starting values for a single RTX 5090 32GB:
batch-size=32num-workers=4epochs=10lr=1e-4
Training saves the best checkpoint to checkpoints/miniclip.pt and the tokenizer vocabulary to checkpoints/tokenizer.json.
Retrieval evaluation is built into train.py and runs automatically after each epoch. You can also call evaluate_retrieval_top1() directly from train.py.
Results can be written to results/ for later comparison.
python demo.py \
--image CLIP.png \
--texts "a dog" "a cat" "a diagram" \
--ckpt checkpoints/miniclip.pt \
--tokenizer checkpoints/tokenizer.json- Best Val Top-1 Accuracy: 62% (at epoch 17)
- Final Loss: 0.0332
- Training Curve: Loss converges steadily from ~0.27 to 0.03
Input image: CLIP.png
| Text | Similarity | Match |
|---|---|---|
| "a dog" | 0.312 | |
| "a cat" | 0.611 | ✅ |
| "a diagram" | 0.077 |
The model correctly identifies the image as "a dog" with higher similarity score.
- This is an educational implementation, not the original OpenAI CLIP codebase.
- Use a real image-text dataset such as COCO Captions for better results.
checkpoints/,results/, and raw datasets should stay out of git (see.gitignore).
Mini-CLIP/
├── README.md
├── requirements.txt
├── .gitignore
├── train.py # training loop + inline evaluation
├── demo.py # inference demo
├── prepare_coco2017.py # COCO2017 dataset preparation
├── prepare_flickr8k.py # Flickr8k dataset preparation
├── src/
│ ├── tokenizer.py # BPE tokenizer
│ ├── preprocess.py # data preprocessing
│ ├── image_encoder.py # ViT-based image encoder
│ ├── text_encoder.py # text encoder
│ ├── model.py # MiniCLIP model
│ ├── loss.py # contrastive loss
│ └── data.py # dataset & dataloader
├── data/ # (gitignored)
│ ├── train.json
│ ├── val.json
│ ├── test.json
│ └── raw/
│ ├── COCO2017/
│ │ ├── train2017/
│ │ ├── val2017/
│ │ └── annotations/
│ └── flickr8k/
│ ├── Images/
│ └── captions.txt
├── checkpoints/ # (gitignored)
└── results/ # (gitignored)