Original Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
This project explores the application of Transformer-based architectures for image classification tasks. Unlike traditional Convolutional Neural Networks (CNNs) commonly used in image classification, this project utilizes Transformer models which adapted to handle image data.
The goal is to investigate the effectiveness of Vision Transformer (ViT) models in image classification by leveraging self-attention mechanisms to capture global dependencies and relationships within images. This project aims to demonstrate the potential of Transformers in understanding and classifying visual content without relying on convolutional operations.
The structure of your dataset we accept is shown as follows:
data
├── Class_1
│ ├── img_1-1.jpg
│ ├── img_1-2.jpg
│ └── ...
├── Class_2
│ ├── img_2-1.jpg
│ ├── img_2-2.jpg
│ └── ...
├── Class_3
│ ├── img_3-1.jpg
│ ├── img_3-2.jpg
│ └── ...
└── ...
The core architecture utilized in this project for image classification tasks is the Vision Transformer (ViT). Unlike traditional Convolutional Neural Networks (CNNs), the ViT model employs a transformer-based architecture, originally proposed for sequence-to-sequence tasks in natural language processing, adapted to handle image data.
- Patch Embeddings: Images are divided into fixed-size patches, which are then linearly embedded to generate sequence inputs for the transformer.
- Positional Encodings: Positional encodings are added to the patch embeddings to provide spatial information about the patches' locations within the image.
- Transformer Encoder: The ViT model consists of multiple transformer encoder layers that process the sequence of patch embeddings using self-attention mechanisms to capture global dependencies.
- Classification Head: A standard linear classification head is appended on top of the transformer encoder to predict the image's class label.
Only for reference in case of 3 classes with 1000+ images in total
- Batch Size: 1280
- Epochs: 80
- Learning Rate: 0.001
- Patch Size: 16
- Hidden Size: 64
- Number of Hidden Layers: 2
- Number of Attention Heads: 3
- Intermediate Size: 256
- Dropout Probability (Hidden): 0.04
- Dropout Probability (Attention): 0.12
The training process involves using the provided dataset split into training and testing sets. The model is trained for 80 epochs using the AdamW optimizer with a learning rate of 0.001. Training progress is monitored via training and testing loss, alongside accuracy metrics.
To replicate the experiment:
-
Clone the Repository:
git clone https://github.com/YapWH1208/Image-Classification.git
-
Prepare Data:
Ensure the video dataset is placed in the appropriate directory (e.g.,
/data). -
Run Training:
python train.py
The trained Vision Transformer (ViT) model exhibits compelling performance in image classification tasks, demonstrating its effectiveness in understanding and categorizing visual content. The evaluation metrics indicate the model's capability to accurately classify images into their respective classes.
- Accuracy: Achieved an overall accuracy of 80.5% on the test dataset, showcasing the model's ability to correctly classify images.
This project is licensed under the Apache License 2.0.