IPDPS'26 Artifact — Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers

This repository provides the artifact for our IPDPS 2026 paper on constraint-driven, block-wise tile scheduling for sparse attention on GPUs.

The artifact implements an end-to-end workflow that:

constructs sparse attention graphs (adjacency matrices),
applies structural reordering (HC / IC / RCM),
extracts nearly-dense computation blocks,
profiles tile-level latency/resources,
solves an ILP (Gurobi) to generate schedules,
benchmarks kernels and reproduces paper figures/tables.

The main idea is to convert irregular sparse attention into structured computation blocks that can be efficiently scheduled with hardware-aware optimization.

Repository Structure

ipdps-26-vit/
├── README.md
├── requirements.txt
├── environment.yml
│
├── models/
│   ├── DynamicVit/
│   ├── RegionVit/
│   └── VisionLongformer/
│
└── src/
    ├── reorder/
    │   ├── hc.py
    │   ├── ic.py
    │   └── rcm.py
    │
    ├── gen_adj.py
    ├── extract_blocks.py
    ├── solve_ilp_gurobi.py
    └── run_bench.py

Setup

We recommend running on an Ampere-class GPU (e.g., NVIDIA A100 80GB). Most scripts will run on smaller GPUs by reducing problem sizes in config files, but full-scale experiments are best reproduced on A100.

Environment (Conda)

conda env create -f environment.yml
conda activate ipdps26-vit

Environment (Pip)

If you cannot use conda:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

FlashAttention note: depending on your system, you may need:
pip install flash-attn --no-build-isolation

Gurobi License

The ILP scheduler uses Gurobi (via gurobipy). Make sure your license is available before running ILP:

install Gurobi + gurobipy
configure your license (e.g., academic license)
verify the license works on your machine

Requirements

Python 3.9+
CUDA 12.x
PyTorch 2.x
SciPy / NumPy
Gurobi 10.x (for ILP solving)
NVIDIA GPU (A100 recommended)

Usage

Full Paper Reproduction

The full reproduction runs the complete optimization pipeline and regenerates the main results reported in the paper.
Since each stage is intentionally decoupled, the pipeline is executed sequentially using individual commands.

Recommended hardware: NVIDIA A100 (80GB).
For smaller GPUs, reduce input sizes in your configuration or model settings.

Summary of Workflow

T1: Adjacency Construction
      ↓
T2: Reordering + Block Extraction
      ↓
T3: Tile Profiling
      ↓
T4: ILP Scheduling
      ↓
T5: Benchmarking
      ↓
T6: Analysis / Plotting

Stage-by-Stage Workflow (T1–T6)

Each stage can be executed independently. Intermediate outputs are stored under results/.

T1 — Adjacency Construction

Generate sparse attention adjacency matrices from model masks.

python src/gen_adj.py \
    --model VisionLongformer \
    --output results/graphs/A.npz

Output:

results/graphs/A.npz

T2 — Graph Reordering + Block Extraction

Apply structural reordering and extract nearly-dense computation blocks.

Reordering (choose one)

IC

python src/reorder/ic.py \
    --input results/graphs/A.npz \
    --output results/graphs/A_ic.npz

HC

python src/reorder/hc.py \
    --input results/graphs/A.npz \
    --output results/graphs/A_hc.npz

RCM

python src/reorder/rcm.py \
    --input results/graphs/A.npz \
    --output results/graphs/A_rcm.npz

Block Extraction

python src/extract_blocks.py \
    --input results/graphs/A_ic.npz \
    --output results/blocks/blocks_ic.json

Supported reordering methods:

rcm — Reverse Cuthill–McKee (SciPy baseline)
hc — hierarchical clustering reordering
ic — iterative clustering reordering
none — disable reordering

T3 — Tile Profiling

Profile candidate tile sizes and gather latency/resource statistics for scheduling.

python src/run_bench.py \
    --profile_only \
    --blocks results/blocks/blocks_ic.json \
    --output results/profiling/profile_ic.json

Output:

results/profiling/profile_ic.json

T4 — ILP Scheduling (Gurobi)

Solve the hardware-aware tile scheduling problem.

python src/solve_ilp_gurobi.py \
    --blocks results/blocks/blocks_ic.json \
    --profile results/profiling/profile_ic.json \
    --output results/schedules/schedule_ic.json

T5 — Benchmarking

Execute kernels according to the generated schedule and collect runtime statistics.

python src/run_bench.py \
    --schedule results/schedules/schedule_ic.json \
    --output results/logs/bench_ic.csv

T6 — Analysis and Figure Generation

The benchmarking stage directly produces raw logs and CSV files that contain latency, throughput, and resource utilization statistics.

# results are generated automatically during benchmarking
python src/run_bench.py \
    --schedule results/schedules/schedule_ic.json \
    --output results/logs/bench_ic.csv

Example Results

Place representative figures in:

figures/

End-to-End Speedup

Reordered Sparse Attention Heatmap

Reproducibility Notes

Benchmarking uses warmup iterations before timing.
Multiple runs are averaged for stable measurements.
FlashAttention kernels may introduce minor non-determinism.

Citation

If you use this artifact, please cite our paper (BibTeX to be added after publication).

Contact

Changxin Li, Sanmukh Kuppannagari — Case Western Reserve University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPDPS'26 Artifact — Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers

Repository Structure

Contents

Setup

Environment (Conda)

Environment (Pip)

Gurobi License

Requirements

Usage

Full Paper Reproduction

Summary of Workflow

Stage-by-Stage Workflow (T1–T6)

T1 — Adjacency Construction

T2 — Graph Reordering + Block Extraction

Reordering (choose one)

Block Extraction

T3 — Tile Profiling

T4 — ILP Scheduling (Gurobi)

T5 — Benchmarking

T6 — Analysis and Figure Generation

Example Results

End-to-End Speedup

Reordered Sparse Attention Heatmap

Reproducibility Notes

Citation

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
figures		figures
models		models
src		src
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

IPDPS'26 Artifact — Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers

Repository Structure

Contents

Setup

Environment (Conda)

Environment (Pip)

Gurobi License

Requirements

Usage

Full Paper Reproduction

Summary of Workflow

Stage-by-Stage Workflow (T1–T6)

T1 — Adjacency Construction

T2 — Graph Reordering + Block Extraction

Reordering (choose one)

Block Extraction

T3 — Tile Profiling

T4 — ILP Scheduling (Gurobi)

T5 — Benchmarking

T6 — Analysis and Figure Generation

Example Results

End-to-End Speedup

Reordered Sparse Attention Heatmap

Reproducibility Notes

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages