This repository accompanies the paper "Exploring Algorithmic Design Choices for Low Latency CNN Deployment", presented at HIPC 2024.
DOI: 10.1109/HiPC62374.2024.00017
It provides:
- SYCL-based implementations of five convolution algorithms using Intel oneAPI DPC++.
- Three CNN models (VGG16, ResNet101, InceptionV4) where all
Conv2dlayers are replaced with custom SYCL-based implementations viactypes. - Benchmarks and execution time tracking for each convolution layer.
| Algorithm | Description | Shared Library |
|---|---|---|
smm |
Scalar Matrix Multiplication | smm_conv.so |
kn2row |
Kernel to Row flattening | kn2row.so |
im2col |
Image to Column flattening | im2col.so |
direct |
Naive nested-loop convolution | direct.so |
depthwise |
Depthwise separable convolution | depthwise.so |
Create and activate a virtual environment (optional but recommended):
python3 -m venv sycl_env
source sycl_env/bin/activateThen install Python dependencies:
pip install -r requirements.txtThis installs
torch,torchvision, and other required packages.
requirements.txt includes:
torch>=2.4.0
torchvision>=0.15.0
Before compiling SYCL kernels with oneAPI and targeting CUDA backend, make sure the following modules or system libraries are available:
module load SYCL/2024.0.1.46
module load CUDA/12.1.1
module load cuDNN/8.9.2.26-CUDA-12.1.1These provide Intel DPC++ compiler, CUDA libraries, and cuDNN support for GPU execution.
Each algorithm is written in its own .cpp file. You can compile any one of them using:
cd sycl_kernels
icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda \
-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 \
SMM.cpp -o ../smm_conv.so -fPIC -shared -lm
sm_70is the architecture for NVIDIA V100. Output should be a.soshared library.
Repeat for: Direct.cpp, Kn2row.cpp, Im2col.cpp, Depthwise.cpp
All nn.Conv2d layers are replaced by a ConvLayer that internally calls SYCL .so libraries via ctypes.
The following models are implemented and support all five convolution algorithms:
- ✅ VGG16
- ✅ ResNet101
- ✅ InceptionV4
You can choose the algorithm via command line argument:
algo = "smm" # Scalar Matrix Multiplication
algo = "kn2row" # Kernel to Row flattening
algo = "im2col" # Image to Column flattening
algo = "direct" # Naive nested-loop
algo = "depthwise" # Per-channel depthwise| Model | Script File |
|---|---|
| VGG16 | models/vgg16_sycl.py |
| ResNet101 | models/resnet101_sycl.py |
| InceptionV4 | models/inceptionv4_sycl.py |
Set the --algo flag to select the convolution method:
python models/vgg16_sycl.py --algo smm
python models/resnet101_sycl.py --algo im2col
python models/inceptionv4_sycl.py --algo kn2rowYou can also run the .out binaries directly (for kernel testing only):
./SMM.outThese log timing results to a CSV file like:
smm_result.csv
.
├── sycl_kernels/ # 🔧 SYCL kernel implementations
│ ├── SMM.cpp # Scalar Matrix Multiplication
│ ├── Kn2row.cpp # Kernel to Row
│ ├── Im2col.cpp # Image to Column
│ ├── Direct.cpp # Naive loop
│ ├── Depthwise.cpp # Depthwise convolution
│ └── README.md # Kernel compilation guide
│
├── models/ # 🧠 CNN model implementations
│ ├── vgg16_sycl.py # VGG16 with dynamic SYCL conv
│ ├── resnet101_sycl.py # ResNet101 with dynamic SYCL conv
│ ├── inceptionv4_sycl.py # InceptionV4 with dynamic SYCL conv
│ └── README.md # Usage instructions
│
├── requirements.txt # Python dependencies
├── README.md # You are here
If you use this work, please cite:
"Exploring Algorithmic Design Choices for Low Latency CNN Deployment"
Changxin Li, Sanmukh Kuppannagari — Case Western Reserve University
@INPROCEEDINGS{10884187,
author={Li, Changxin and Kuppannagari, Sanmukh},
booktitle={2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC)},
title={Exploring Algorithmic Design Choices for Low Latency CNN Deployment},
year={2024},
pages={78-88},
doi={10.1109/HiPC62374.2024.00017}
}