Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
b3594f2
[SPIR-V] Sparse adstack heap (Phases A-C): lazy LCA-claim row id
duburcqa Apr 30, 2026
3177b4b
[SPIR-V][Runtime] Sparse adstack heap (Phase D-1, D-2): per-task coun…
duburcqa Apr 30, 2026
9aad0d8
[SPIR-V] Sparse adstack heap (Stage 1, IR pattern match for static bo…
duburcqa Apr 30, 2026
279c257
[SPIR-V] Sparse adstack heap (Stage 1.5, split per-heap-kind row claim)
duburcqa Apr 30, 2026
502bc89
[SPIR-V] Sparse adstack heap (Stage 1.3, generic bound-reducer comput…
duburcqa Apr 30, 2026
8efec24
[Runtime] Sparse adstack heap (Stage 1.4 scaffold): bound-reducer pip…
duburcqa Apr 30, 2026
36d658f
[Runtime] Sparse adstack heap (Stage 1.4): dispatch reducer per task …
duburcqa Apr 30, 2026
6828d76
[SPIR-V] Sparse adstack heap: hoist matched_var alloca above OpBranch…
duburcqa Apr 30, 2026
1733dea
[Test] Sparse adstack heap (Stage 1.6): pin grad correctness on ndarr…
duburcqa Apr 30, 2026
dbb27c6
[SPIR-V][Runtime] Sparse adstack heap: defense-in-depth bounds check …
duburcqa Apr 30, 2026
bf5eb73
[SPIR-V][Runtime] Sparse adstack heap: bound reducer reads SNode-back…
duburcqa Apr 30, 2026
1b90015
[Test] Sparse adstack heap: pin SNode-backed bound_expr grad correctn…
duburcqa Apr 30, 2026
43a01ad
[WIP] Sparse adstack heap: shared static analysis + LLVM split-heap i…
duburcqa Apr 30, 2026
ccb6f91
[LLVM] Sparse adstack heap: enable shared static analysis on LLVM cod…
duburcqa Apr 30, 2026
fb7a98f
[LLVM] Sparse adstack heap: ensure helpers for split float / int heap…
duburcqa Apr 30, 2026
474b91f
[LLVM] Sparse adstack heap: emit lazy float-heap row claim helper at …
duburcqa Apr 30, 2026
cff3c54
[LLVM] Sparse adstack heap: per-kernel row counter
duburcqa Apr 30, 2026
96b726c
[LLVM] Sparse adstack heap: activate the lazy LCA-block float-heap ro…
duburcqa Apr 30, 2026
ba93287
[LLVM] Sparse adstack heap: route emit_ad_stack_single_slot_ptr throu…
duburcqa Apr 30, 2026
ce23bc6
[LLVM] Sparse adstack heap: route every push / pop / load-top / load-…
duburcqa Apr 30, 2026
5c5dd9a
[LLVM] Sparse adstack heap: activate the lazy float-heap path in visi…
duburcqa Apr 30, 2026
783a99b
[LLVM] Sparse adstack heap: host-side ndarray bound_expr reducer wire…
duburcqa Apr 30, 2026
a2fcade
[LLVM] Sparse adstack heap: split float heap allocation
duburcqa Apr 30, 2026
133c2b2
[LLVM] Sparse adstack heap: per-arch device-side reducer for CUDA / A…
duburcqa Apr 30, 2026
1c0d15f
[LLVM] Sparse adstack heap: reflow comments in llvm_runtime_executor.…
duburcqa Apr 30, 2026
88bae23
[DEBUG] Sparse adstack heap: print [ADSTACK-FHEAP] / [ADSTACK-FHEAP-L…
duburcqa Apr 30, 2026
814b56b
[LLVM] Sparse adstack heap: SNode-backed gate capture
duburcqa Apr 30, 2026
e620aaf
[LLVM] Sparse adstack heap: post-reducer float-heap sizing
duburcqa Apr 30, 2026
2ea77d7
[LLVM] Sparse adstack heap: unconditional split routing
duburcqa Apr 30, 2026
e0454ac
[LLVM] Sparse adstack heap: drop unused legacy combined-heap allocation
duburcqa Apr 30, 2026
ab3ae23
[Lang] Sparse adstack heap: drop the [ADSTACK-FHEAP] / [ADSTACK-HEAP-…
duburcqa Apr 30, 2026
5e1be77
[Lang] Sparse adstack heap: address PR review fixes
duburcqa Apr 30, 2026
90cdb1c
[Test] Sparse adstack heap: extend the bound_expr ndarray gate test t…
duburcqa Apr 30, 2026
1cd2389
[Lang] Sparse adstack heap: handle SNode-backed bound_expr on the LLV…
duburcqa Apr 30, 2026
ab6960a
[Lang] Sparse adstack heap: speculative defense-in-depth and predicat…
duburcqa Apr 30, 2026
d4c547d
[Test] Sparse adstack heap: parametrize the memory-savings end-to-end…
duburcqa Apr 30, 2026
1425a4c
[Lang] Sparse adstack heap: fix per-task bound-reducer length on the …
duburcqa Apr 30, 2026
1114d99
[Test] Sparse adstack heap: drop the resource-budget meta-commentary …
duburcqa Apr 30, 2026
f0759e0
[Lang] Sparse adstack heap: floor (not ceiling) division when computi…
duburcqa Apr 30, 2026
3595c3f
[Lang] Sparse adstack heap: address remaining bot-flagged review issues
duburcqa Apr 30, 2026
9b73e9e
[Test] Sparse adstack heap: clarify the test_adstack_static_bound_exp…
duburcqa Apr 30, 2026
8bdcaf5
[Test] Sparse adstack heap: switch test_adstack_static_bound_expr_non…
duburcqa Apr 30, 2026
3abb018
[Lang] Sparse adstack heap: extend the LLVM CPU / CUDA / AMDGPU bound…
duburcqa Apr 30, 2026
2b73144
[Test] Sparse adstack heap: pin the LLVM CUDA / AMDGPU dispatch-cap f…
duburcqa Apr 30, 2026
99b5743
[Test] Sparse adstack heap: pin the LLVM CPU host-reducer SNode arm o…
duburcqa Apr 30, 2026
f4ef8ab
[Lang] Sparse adstack heap: bound the snode_resolver tree-id scan wit…
duburcqa Apr 30, 2026
524e6ac
[Test] Sparse adstack heap: pin the LLVM CUDA / AMDGPU device sizer p…
duburcqa Apr 30, 2026
5c2bbf4
[Lang] Sparse adstack heap: extend the SPIR-V bound-reducer dispatch …
duburcqa Apr 30, 2026
198287c
[Test] Sparse adstack heap: pin the SPIR-V bound-reducer f64 gating-f…
duburcqa Apr 30, 2026
8855942
[Lang] Sparse adstack heap: hoist the AdStackBoundRowCapacity buffer …
duburcqa Apr 30, 2026
bef9771
[Lang] Sparse adstack heap: switch the LLVM CUDA / AMDGPU launchers' …
duburcqa Apr 30, 2026
0d3d639
[Test] Sparse adstack heap: pin the SPIR-V launcher's resolve_length …
duburcqa May 1, 2026
085f2ef
[Lang] Sparse adstack heap: restore deleted explanatory comments flag…
duburcqa May 1, 2026
0ef750f
[Lang] Sparse adstack heap: persistent QD_DEBUG_ADSTACK heap-bind pri…
duburcqa May 1, 2026
1c11bd5
[Lang] Sparse adstack heap: skip eager-path tasks in synchronize() la…
duburcqa May 1, 2026
12e0ba0
[Test] Sparse adstack heap: pin eager-task last_observed_rows skip vi…
duburcqa May 1, 2026
dbcdf40
[Lang] Sparse adstack heap: mirror stride_int_bytes in the LLVM devic…
duburcqa May 1, 2026
845e168
[Lang] Sparse adstack heap: walk LLVM declaration-order SNode offsets…
duburcqa May 1, 2026
6f8de95
[Test] Sparse adstack heap: pin the multi-leaf dense SNode gate offse…
duburcqa May 1, 2026
c9f44f0
[Lang] Sparse adstack heap: validate gate index loop matches first it…
duburcqa May 1, 2026
f94d7db
[Test] Sparse adstack heap: drop unjustified arch restrictions on PR-…
duburcqa May 1, 2026
3b24178
[Docs] Sparse adstack heap: tighten autodiff.md num_threads cap row +…
duburcqa May 1, 2026
e002f45
[Lang] Sparse adstack heap: hard-error the SPIR-V tertiary heap-sizin…
duburcqa May 1, 2026
36848d8
[Lang] Sparse adstack heap: close the symmetric-form, SNode-arm and S…
duburcqa May 1, 2026
c8c7274
[Lang] Sparse adstack heap: extend the LLVM codegen adstack-alloca pr…
duburcqa May 1, 2026
7b919e0
[Test] Sparse adstack heap: rewrite PR-added test docstrings with end…
duburcqa May 1, 2026
6800252
[Lang] Sparse adstack heap: reflow PR-added comments to fill 120 cols…
duburcqa May 1, 2026
82002b3
[Lang] Sparse adstack heap: restore the align_up_8 alignment-rational…
duburcqa May 1, 2026
e8ef1ec
[Lang] Sparse adstack heap: gate the per-launch publish work in CPU /…
duburcqa May 1, 2026
871464f
[Lang] Sparse adstack heap: gate the bound_expr capture on a conserva…
duburcqa May 1, 2026
f5c1fcf
[Lang] Sparse adstack heap: rewrite PR-added comments flagged by the …
duburcqa May 1, 2026
27be3ab
[Test] Sparse adstack heap: add a C++ unit test for build_adstack_bou…
duburcqa May 1, 2026
45ca699
[Lang] Sparse adstack heap: extract the lazy-claim / bound-reducer / …
duburcqa May 1, 2026
fe68eed
[Doc] Sparse adstack heap: fix the 'raise it' direction in the ad_sta…
duburcqa May 1, 2026
beaa49a
[Lang] Sparse adstack heap: validate the SNode access-path's index ex…
duburcqa May 1, 2026
9933b62
[Lang] Sparse adstack heap: declare the per_thread_stride_float_bytes…
duburcqa May 1, 2026
0ccdc94
[Lang] Sparse adstack heap: serialize ad_stack_sparse_threshold_bytes…
duburcqa May 1, 2026
d37f2ae
[Doc] Sparse adstack heap: rewrite ad_stack_experimental_enabled draw…
duburcqa May 1, 2026
bbc0191
[Lang] Sparse adstack heap: refresh two stale codegen_llvm.cpp commen…
duburcqa May 1, 2026
76acc1b
[Doc] Sparse adstack heap: spell out that enabling adstack globally i…
duburcqa May 1, 2026
47fba3a
[Doc] Sparse adstack heap: add a Recommendation note to autodiff.md t…
duburcqa May 1, 2026
24ed143
[Doc] Sparse adstack heap: soften the recommendation prefix from 'we …
duburcqa May 1, 2026
2c8d275
[Lang] Sparse adstack heap: address bot review - cache-key gap, debug…
duburcqa May 1, 2026
1c0011d
[Lang] Sparse adstack heap: drop over-strict SNode validation, cap SP…
duburcqa May 1, 2026
f3da2f8
Mark unit test as xfail.
duburcqa May 1, 2026
aa1e214
[Lang] Sparse adstack heap: address bot review (dead code, strict-ali…
duburcqa May 1, 2026
e951423
[Doc] Sparse adstack heap: document SNode-gate compound-index limitat…
duburcqa May 1, 2026
845bd82
[Doc] Sparse adstack heap: prefix qd.field compound-index gate limita…
duburcqa May 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions docs/source/user_guide/autodiff.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ Automatic differentiation (autodiff) computes the exact gradient of a kernel's o

**Note.** Throughout this page, the *primal* is the value a kernel computes in its normal forward pass (the field value, the loss, whatever the kernel writes); the *adjoint* (or *gradient*) is the derivative of the final scalar output (typically a loss) with respect to that primal value, stored in the `.grad` field next to the primal.

Quadrants implements autodiff at compile time: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework. Forward mode and reverse mode are available on every backend Quadrants targets: x64 / arm64 CPU, CUDA, AMDGPU, Metal, and Vulkan. Reverse-mode AD through dynamic loops (described further down) is currently behind an opt-in `ad_stack_experimental_enabled=True` flag.
Quadrants implements autodiff at compile time: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework. Forward mode and reverse mode are available on every backend Quadrants targets: x64 / arm64 CPU, CUDA, AMDGPU, Metal, and Vulkan.

**Recommendation.** Reverse-mode AD through dynamic loops (described further down) is currently gated behind an opt-in `ad_stack_experimental_enabled=True` flag at `qd.init`. If you are using autodiff at all, we recommend enabling this flag as it is required for any reverse-mode kernel with a dynamic loop carrying a non-linear primal, and free for every other kernel. See [the cost breakdown](./init_options.md#ad_stack_experimental_enabled) for details.

Three mechanisms are supported:

Expand Down Expand Up @@ -291,11 +293,10 @@ The on-device sizer relies on two common hardware features (64-bit integer arith

#### Manual override

`qd.init()` exposes a single escape hatch:

- `ad_stack_size=N` (default `0`, meaning "let the sizer decide"): forces every adstack in the program to exactly `N` slots and bypasses the sizer entirely.
`qd.init()` exposes two escape hatches:

Leave it at `0` in day-to-day use. Setting it to a positive `N` is meant for stress tests or for working around a suspected sizer bug; it defeats the per-launch-exact sizing, so every dispatch allocates the full `N` slots whether the kernel actually needs them or not.
- `ad_stack_size=N` (default `0`): forces every adstack to exactly `N` slots and bypasses the sizer. Leave at `0` in day-to-day use; positive `N` is for stress tests or working around a suspected sizer bug.
- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk.

#### Memory footprint

Expand All @@ -311,11 +312,13 @@ where each quantity means:

| Quantity | What it is |
| --- | --- |
| `num_threads` | Threads the kernel actually dispatches. On CPU: the thread-pool size, typically tens. On GPU: the full ndrange. |
| `num_threads` | Concurrent thread slots, regardless of logical ndrange. CPU: thread-pool size (~tens). GPU adstack-bearing kernels: capped at 65536 on all backends (131072 on SPIR-V range-for, i.e. `for i in range(N):`), tightened to the actual flat product when the iteration bound is compile-time known. Forward-only kernels keep the full ndrange. |
| `stack_size` | Per-launch capacity resolved by the sizer. Varies between launches - if an ndarray-bounded loop iterates 16 times at one dispatch and 1024 at another, `stack_size` tracks each. |
| `bytes_per_slot` | Depends on `T` and on the backend (see table below). |
| `num_buffers` | Number of adstacks the kernel allocates - one per loop-carried variable plus one per dependent branch flag (see [One adstack per variable](#one-adstack-per-variable)). |

Kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` (a runtime gate directly above the adstack-using body, comparing one field entry to a constant) shrink further: the compiler counts gate-passing iterations at launch time and sizes the float adstack to that count instead of `num_threads * stack_size`. A workload whose gate matches 5% of iterations pays 5% of the float-adstack cost; the float heap grows on demand if a later launch matches more. Integer / boolean adstacks stay at `num_threads * stack_size` - their pushes fire unconditionally for control-flow replay. The shrinking is exact only when the gate's per-axis index is a bare loop variable (`field[i]`, `field[I, J, K]`); see [What can go wrong](#what-can-go-wrong) for a known limitation on `qd.field`-backed gates indexed by compound expressions.

Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot.

Platform-specific notes:
Expand Down Expand Up @@ -351,6 +354,9 @@ A large `ndrange` combined with several loop-carried variables multiplies quickl
- pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
- **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the computed gradient may come out wrong, sometimes as an `Adstack overflow` exception at `qd.sync()`, sometimes silently. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
- :warning: **Gate on a `qd.field` indexed by an expression that is not a plain loop variable.** A reverse-mode kernel of the shape `for i in range(n): if field[i % K] > eps: <adstack work>` (or any gate whose index is not a plain loop variable - `field[2 * i]`, `field[42]`, `field[other_field[i]]`) may produce silently wrong gradients. Workarounds:
- raise `ad_stack_sparse_threshold_bytes` in `qd.init()` past the kernel's conservative-heap byte size;
- use a `qd.ndarray` for the gating field instead of a `qd.field`.

## Performance characteristics

Expand Down
8 changes: 8 additions & 0 deletions docs/source/user_guide/debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,11 @@ QD_DUMP_IR=1 QD_OFFLINE_CACHE=0 python my_script.py
```

Compiled kernels will be written to `/tmp/ir` by default. Use `QD_DEBUG_DUMP_PATH=` to redirect to a custom directory.

### Tracing adstack heap allocations

```bash
QD_DEBUG_ADSTACK=1 python my_script.py
```

Prints one line per task per kernel launch describing each adstack heap binding: task name, heap kind (float or int), sizing source (per-task reducer count or dispatched-threads worst case), per-thread stride, and resulting allocation in bytes. Useful for pinning which task drives the peak when an adstack-bearing kernel hits an OOM and the remedies in [Avoiding OOM on GPU](./autodiff.md#avoiding-oom-on-gpu) do not point at an obvious culprit.
21 changes: 21 additions & 0 deletions docs/source/user_guide/init_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,27 @@ Whether to enable IEEE-relaxed floating-point optimizations (FMA fusion, no NaN

Number of host threads used when compiling kernels. Default `4`. Raise on machines with many idle cores compiling many kernels back-to-back; lower (or set to `1`) on memory-pressure-bound systems where concurrent LLVM compilations thrash.

## Reverse-mode autodiff

See [Autodiff](./autodiff.md) for the reverse-mode pipeline overview.

### `ad_stack_experimental_enabled`

Enables the dynamic-loop reverse-mode pipeline (the *adstack*). Default `False`. Required when a reverse-mode kernel has a runtime-bounded loop carrying a non-linear primal; without it, such kernels either compile-error or produce silently-wrong gradients depending on the loop shape. See [Autodiff with dynamic loops](./autodiff.md#autodiff-with-dynamic-loops) for the rules. Adstack-on is safe even when not strictly needed, but it does come with a few drawbacks:

- **Memory.** The reverse pass replays each iteration of the dynamic loop, so the adstack stores per-iteration intermediate values for every thread. See [Memory footprint](./autodiff.md#memory-footprint) for the exact formula and the knobs that shrink it (`ad_stack_size`, `ad_stack_sparse_threshold_bytes`).
- **Per-launch overhead.** Every backward kernel launch incurs a small fixed CPU-to-GPU data transfer. Kernels whose dynamic loop is gated by a sparse predicate (e.g. `for i in range(n): if active[i] > 0: ...`) additionally run a fast GPU pre-step that counts how many threads pass the gate so that the adstack can be tightly sized instead of upper-bounded by worst case.

*Note.* These drawbacks affect only reverse-mode kernels that actually use the adstack; forward-only kernels and reverse-mode kernels without a dynamic non-linear inner loop pay nothing extra. In other words, enabling adstack globally is effectively free except for kernels that need it anyway!

### `ad_stack_size`

Forces every adstack in the program to exactly `N` slots and bypasses the launch-time sizer. Default `0`, meaning "let the sizer decide" (the recommended setting for day-to-day use). Setting a positive `N` is meant for stress tests or working around a suspected sizer bug; it defeats the per-launch-exact sizing so every dispatch allocates the full `N` slots whether or not the kernel actually needs them. Has no effect when `ad_stack_experimental_enabled=False`.

### `ad_stack_sparse_threshold_bytes`

Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate.

## Debugging

See [Debug mode](./debug.md) for runnable examples and a typical develop / benchmark workflow.
Expand Down
2 changes: 2 additions & 0 deletions quadrants/analysis/offline_cache_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,9 @@ static std::vector<std::uint8_t> get_offline_cache_key_of_compile_config(const C
serializer(config.saturating_grid_dim);
serializer(config.cpu_max_num_threads);
}
serializer(config.ad_stack_experimental_enabled);
serializer(config.ad_stack_size);
serializer(config.ad_stack_sparse_threshold_bytes);
Comment thread
duburcqa marked this conversation as resolved.
serializer(config.random_seed);
serializer(config.make_mesh_block_local);
serializer(config.optimize_mesh_reordered_mapping);
Expand Down
Loading
Loading