IPDPS'26 Artifact — Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers
This repository provides the artifact for our IPDPS 2026 paper on constraint-driven, block-wise tile scheduling for sparse attention on GPUs.
The artifact implements an end-to-end workflow that:
- constructs sparse attention graphs (adjacency matrices),
- applies structural reordering (HC / IC / RCM),
- extracts nearly-dense computation blocks,
- profiles tile-level latency/resources,
- solves an ILP (Gurobi) to generate schedules,
- benchmarks kernels and reproduces paper figures/tables.
The main idea is to convert irregular sparse attention into structured computation blocks that can be efficiently scheduled with hardware-aware optimization.
ipdps-26-vit/
├── README.md
├── requirements.txt
├── environment.yml
│
├── models/
│ ├── DynamicVit/
│ ├── RegionVit/
│ └── VisionLongformer/
│
└── src/
├── reorder/
│ ├── hc.py
│ ├── ic.py
│ └── rcm.py
│
├── gen_adj.py
├── extract_blocks.py
├── solve_ilp_gurobi.py
└── run_bench.py
We recommend running on an Ampere-class GPU (e.g., NVIDIA A100 80GB). Most scripts will run on smaller GPUs by reducing problem sizes in config files, but full-scale experiments are best reproduced on A100.
conda env create -f environment.yml
conda activate ipdps26-vitIf you cannot use conda:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFlashAttention note: depending on your system, you may need:
pip install flash-attn --no-build-isolation
The ILP scheduler uses Gurobi (via gurobipy). Make sure your license is available before running ILP:
- install Gurobi +
gurobipy - configure your license (e.g., academic license)
- verify the license works on your machine
- Python 3.9+
- CUDA 12.x
- PyTorch 2.x
- SciPy / NumPy
- Gurobi 10.x (for ILP solving)
- NVIDIA GPU (A100 recommended)
The full reproduction runs the complete optimization pipeline and regenerates the main results reported in the paper.
Since each stage is intentionally decoupled, the pipeline is executed sequentially using individual commands.
Recommended hardware: NVIDIA A100 (80GB).
For smaller GPUs, reduce input sizes in your configuration or model settings.
T1: Adjacency Construction
↓
T2: Reordering + Block Extraction
↓
T3: Tile Profiling
↓
T4: ILP Scheduling
↓
T5: Benchmarking
↓
T6: Analysis / Plotting
Each stage can be executed independently. Intermediate outputs are stored under results/.
Generate sparse attention adjacency matrices from model masks.
python src/gen_adj.py \
--model VisionLongformer \
--output results/graphs/A.npzOutput:
results/graphs/A.npz
Apply structural reordering and extract nearly-dense computation blocks.
IC
python src/reorder/ic.py \
--input results/graphs/A.npz \
--output results/graphs/A_ic.npzHC
python src/reorder/hc.py \
--input results/graphs/A.npz \
--output results/graphs/A_hc.npzRCM
python src/reorder/rcm.py \
--input results/graphs/A.npz \
--output results/graphs/A_rcm.npzpython src/extract_blocks.py \
--input results/graphs/A_ic.npz \
--output results/blocks/blocks_ic.jsonSupported reordering methods:
rcm— Reverse Cuthill–McKee (SciPy baseline)hc— hierarchical clustering reorderingic— iterative clustering reorderingnone— disable reordering
Profile candidate tile sizes and gather latency/resource statistics for scheduling.
python src/run_bench.py \
--profile_only \
--blocks results/blocks/blocks_ic.json \
--output results/profiling/profile_ic.jsonOutput:
results/profiling/profile_ic.json
Solve the hardware-aware tile scheduling problem.
python src/solve_ilp_gurobi.py \
--blocks results/blocks/blocks_ic.json \
--profile results/profiling/profile_ic.json \
--output results/schedules/schedule_ic.jsonExecute kernels according to the generated schedule and collect runtime statistics.
python src/run_bench.py \
--schedule results/schedules/schedule_ic.json \
--output results/logs/bench_ic.csvThe benchmarking stage directly produces raw logs and CSV files that contain latency, throughput, and resource utilization statistics.
# results are generated automatically during benchmarking
python src/run_bench.py \
--schedule results/schedules/schedule_ic.json \
--output results/logs/bench_ic.csvPlace representative figures in:
figures/
- Benchmarking uses warmup iterations before timing.
- Multiple runs are averaged for stable measurements.
- FlashAttention kernels may introduce minor non-determinism.
If you use this artifact, please cite our paper (BibTeX to be added after publication).
Changxin Li, Sanmukh Kuppannagari — Case Western Reserve University




