HPC & Deep Learning Systems Researcher
Optimizing the "plumbing" of AI — from kernels to clusters.
I research low-level optimization for Deep Learning workloads, focusing on bridging the gap between high-level PyTorch APIs and hardware reality. My work involves:
- Kernel Optimization: Writing custom OpenAI Triton kernels to beat eager execution (Fused Attention, Softmax).
- Quantization: Implementing 4-bit/INT8 inference pipelines (AWQ/GPTQ) for deploying 7B+ models on consumer GPUs.
- Distributed Systems: Analyzing NCCL communication primitives and distributed training bottlenecks (DDP/FSDP).
| Domain | Tools & Frameworks |
|---|---|
| HPC & Kernels | OpenAI Triton · CUDA (Concepts) · NVIDIA Nsight Compute · TensorRT |
| Deep Learning | PyTorch · HuggingFace (Transformers/PEFT) · AutoGPTQ · ONNX Runtime |
| Infrastructure | Docker · Linux (Kernel/eBPF) · Bash · Slurm |
| Core | Python (AsyncIO) · C++ · PostgreSQL · NumPy |
- high-performance-deep-learning: My primary research repo containing custom Triton kernels, quantization benchmarks, and distributed system simulations.
- Neuro-Hedge: A vectorized Monte Carlo simulation engine for Reinforcement Learning.
