How to write fast Number Theoretic Transform kernels — from math to metal.
Start with the notebooks in order: they walk from DFT→NTT foundations through CKKS-RNS motivation, reference implementations of Cooley-Tukey and Stockham, modular arithmetic tricks (Barrett/Montgomery/Shoup), Harvey's lazy-reduction butterflies, and finally benchmarking methodology with roofline analysis.
The Python reference package ntt_edu under src/ is the single source of truth
for correctness; every optimized kernel (C++ scalar, SIMD, GPU, Tenstorrent) is
validated against it.
uv sync
uv run jupyter labPLAN.md— live design artifact (read this first)CLAUDE.md— protocol for AI coding sessions across machinesSTYLE.md— coding and notebook conventionsnotebooks/— numbered educational notebookssrc/ntt_edu/— Python reference implementationssrc/ntt_kernels/— C++ optimized kernels (Phase B+)tests/— correctness and parity testsdoc/— numbered design docs +discussions/for decision logsbench-results/— JSONL measurements + roofline plots