[AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr by duburcqa · Pull Request #543 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-22T05:10:10Z

Runtime-evaluated adstack sizing on LLVM backends

Makes reverse-mode adstack capacity automatic on CPU / CUDA / AMDGPU: the compiler captures a symbolic upper-bound expression per adstack alloca and the host launcher resizes the per-thread heap just before each dispatch. No user-facing API change; default_ad_stack_size / ad_stack_size remain accepted in qd.init() and the runtime falls back to default_ad_stack_size (with a QD_WARN) when an adstack's enclosing loop shape is outside the SizeExpr grammar. Shapes inside the grammar are sized exactly at launch time.

TL;DR

import quadrants as qd

qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True)

n = qd.field(qd.i32, shape=())
x = qd.field(qd.f32, shape=16, needs_grad=True)
y = qd.field(qd.f32, shape=(), needs_grad=True)

@qd.kernel
def compute():
    for i in x:
        v = x[i]
        for _ in range(n[None]):           # trip count known only at launch time
            v = v * 0.95 + 0.01
        y[None] += v

n[None] = 512
compute()
y.grad[None] = 1.0
compute.grad()                              # no QuadrantsAssertionError: adstack sized to 512 + 2

Before this PR, the inner range(n[None]) forced the CFG analyzer into its default_ad_stack_size=256 fallback, so n[None] = 512 overflowed at the next qd.sync(). After this PR the compiler captures the loop bound as a SizeExpr, the launcher evaluates it against n[None] at launch time, and the backing heap is grown to exactly 512 + 2 slots.

Why

Reverse-mode AD through a dynamic loop requires an adstack whose depth equals the forward trip count plus two setup pushes. Until now Quadrants had to resolve that depth at compile time, so every loop bound that involved a field load (n_bodies, n_dofs, an ndarray shape, an enclosing parallel-for index) fell back to the user-configured default_ad_stack_size. For realistic reverse-mode kernels this is both:

Wrong-too-small: a 14-dof mechanism stepping through an ndrange(n_bodies, n_dofs) easily needs >256 slots, triggering a runtime QuadrantsAssertionError: Adstack overflow.
Wrong-too-big: bumping the default to cover worst case inflates every adstack in the program, including the ones the compiler could prove were tiny, and scales linearly with the dispatched ndrange on GPU.

The traditional workaround - tune default_ad_stack_size per program - is not tunable enough: you cannot set a different default for every adstack, and bumping it globally multiplies GPU memory by the number of threads. This PR replaces the static fallback with a symbolic upper-bound expression captured at compile time and reduced per launch against the live field / ndarray state, so each adstack gets exactly the depth that dispatch needs.

Entry point

LlvmRuntimeExecutor::publish_adstack_metadata(const AdStackSizingInfo &ad_stack, std::size_t num_threads, LaunchContextBuilder *ctx, void *device_runtime_context_ptr = nullptr) at quadrants/runtime/llvm/llvm_runtime_executor.cpp. Called from cpu/cuda/amdgpu/kernel_launcher.cpp right before every task's dispatch. It:

Walks the task's size_exprs vector (post-order SerializedSizeExpr trees, one per AdStackAllocaStmt).
Evaluates each tree via adstack_size_expr_eval.cpp, reading scalar field values through SNodeRwAccessorsBank and launch-context ndarray shapes / scalar reads through the LaunchContextBuilder. The AMDGPU path threads the launcher's device-side RuntimeContext staging copy (kernel_launcher.cpp:185-186) through device_runtime_context_ptr so HIP kernels that resolve ExternalTensorRead leaves against ndarray pointers do not fault on a host pointer; CUDA relies on UVA and passes nullptr; CPU ignores the argument.
Computes per-thread stride, per-alloca offsets, and per-alloca max-sizes.
Writes these into LLVMRuntime::adstack_{per_thread_stride,offsets,max_sizes} via runtime_get_adstack_metadata_field_ptrs so generated code reads them at every push / load-top site.
Grows the per-thread heap backing buffer via ensure_adstack_heap(per_thread_stride * dispatched_threads).

Shapes the grammar covers are sized exactly. Shapes outside the grammar (e.g. a range(field[i]) / range(ndarray[i]) indexed by a loop-carried variable that the structural walker does not fold) fall back to default_ad_stack_size and emit a QD_WARN naming the source location so the user can either restructure the loop or extend the grammar.

Mechanism end-to-end

1. Compile-time `SizeExpr` capture

quadrants/transforms/determine_ad_stack_size.cpp runs a Bellman-Ford-first, structural-walker-second pre-pass that attempts to express each AdStackAllocaStmt's max depth as a SizeExpr tree. Grammar covered:

Leaf	Source	Lowering
Const	integer literal	`SizeExpr::make_const`
FieldLoad	`GlobalLoadStmt` of a scalar `i32` / `i64` SNode	`make_field_load` with the SNode pointer + constant index path
NdarrayShape	`ExternalTensorShapeAlongAxisStmt`	`make_ndarray_shape(arg_id, axis)`
NdarrayRead	scalar read from an integer ndarray	`make_ndarray_read(arg_id, index_path)`
LoopIndexMax	`LoopIndexStmt` of an enclosing bounded range-for / parallel-for	`make_loop_index_max` (bounded by the loop's `end` or a `max` over every push site)
StackLoadTopMax	`AdStackLoadTopStmt` of a stash adstack	`make_stack_load_top_max` (bounded by a `max` over every push site)

Inner operators: add, sub, mul (const * anything via Add-replication), max. expr_sub's MaxOverRange fusion caps the fused iteration range at min(shape_a, shape_b) (expressed in-grammar as a + b - max(a, b)) so the fused body cannot read the shorter ndarray past its buffer end when a caller passes mismatched-length ndarrays. Anything else falls back to default_ad_stack_size and emits the QD_WARN noted above. SizeExpr objects are immutable after the pre-pass writes them and are serialised as post-order SerializedSizeExpr::nodes so offline-cache hits evaluate the same way as fresh compiles.

2. Codegen reads stride / offset / max-size from runtime metadata

File	Change
`codegen_llvm.cpp` `AdStackAllocaStmt`	base addr = `heap + linear_tid * runtime->adstack_per_thread_stride + runtime->adstack_offsets[stack_id]` (all three loaded at runtime)
`codegen_llvm.cpp` `AdStackPushStmt`	`stack_push(runtime, stack, runtime->adstack_max_sizes[stack_id], element_size)` (cap is a runtime load, not an IR immediate)
`init_offloaded_task_function` pre-scan	assigns `AdStackAllocaStmt::stack_id` in declaration order; `ad_stack_offsets_` became `std::vector<std::size_t>` indexed by `stack_id`

The heap base address is unchanged; only the per-thread stride and per-alloca offset math was moved from codegen-time immediates to runtime loads, so kernels compiled with the old flow will recompile but do not break semantics.

Note: default_ad_stack_size and ad_stack_size are ignored on LLVM for adstacks whose bound is inside the grammar. They remain in qd.init() for SPIR-V backwards compatibility and for the warn-and-fallback path where the pre-pass cannot resolve the loop shape symbolically.

3. Host evaluator

adstack_size_expr_eval.{h,cpp} walks the flat post-order tree:

Field loads go through the existing SNodeRwAccessorsBank to read the current scalar value on the host side.
Ndarray shape / scalar reads go through LaunchContextBuilder which captures both the pointer and the shape on kernel launch.
LoopIndexMax and StackLoadTopMax fold to whichever launch-time bound they were captured against.
Arithmetic is saturating int64_t.

The evaluator runs strictly on the host, before the kernel dispatches, so there is no device-side evaluator to keep in sync.

4. Per-kernel metadata bridge

runtime_get_adstack_metadata_field_ptrs(LLVMRuntime *) in runtime.cpp returns pointers to adstack_{per_thread_stride,offsets,max_sizes}. The host writes into these fields once per dispatch. On CPU the fields are read directly from LLVMRuntime; on CUDA / AMDGPU the bridge triggers a memcpy_host_to_device so the generated code's subsequent loads see the right values. The memcpy is gated on the task actually carrying an adstack: kernels with no reverse-mode adstacks skip publish_adstack_metadata entirely and pay no per-launch cost.

Per-backend coverage matrix

Backend	Pre-pass	Metadata bridge	Launcher hook	Status
CPU (LLVM)	ok	direct field access	`cpu/kernel_launcher.cpp`	auto-sized inside the grammar; `default_ad_stack_size` fallback + `QD_WARN` outside
CUDA (LLVM)	ok	`memcpy_host_to_device`	`cuda/kernel_launcher.cpp`	same
AMDGPU (LLVM)	ok	`memcpy_host_to_device` (host pointer routed through the launcher's device-side `RuntimeContext` staging copy)	`amdgpu/kernel_launcher.cpp`	same
Metal (SPIR-V)	N/A here	N/A here	N/A here	untouched - still compile-time, handled by the stacked SPIR-V sizer PR
Vulkan (SPIR-V)	N/A here	N/A here	N/A here	untouched - still compile-time, handled by the stacked SPIR-V sizer PR

Tests

`tests/python/test_adstack.py`

test_adstack_field_load_bounded_loop_evaluated_per_launch - pins the core guarantee: a kernel with for _ in range(n[None]) bound by an i32 scalar field. Launches at n_iter in (1, 20, 50) and asserts the heap resized correctly (no overflow at 50, no over-allocation at 1), with the decorator set to default_ad_stack_size=2 to prove the knob is bypassed when the grammar resolves the bound symbolically.
test_adstack_near_capacity - rewritten to exercise both sides of the old default_ad_stack_size=32 boundary at n_iter in (30, 100) without any knob override. Arch-restricted to qd.cpu until the SPIR-V sizer PR lands.
test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends / ..._with_mismatched_shape_ends - pin the expr_sub MaxOverRange fusion for both the strict-equality and the cross-ndarray same-axis paths.
test_adstack_fuses_sub_of_max_over_range_with_mismatched_lengths_is_safe - pins the min(shape_a, shape_b) end cap so the cross-ndarray fusion stays in-bounds when the two ndarrays have different lengths at launch time (pre-fix the fused body read the shorter ndarray past its buffer end).
test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index / test_adstack_inner_range_bounded_by_multidim_ndarray_read / test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var - cover the ExternalTensorRead grammar leaves, including multi-axis stride handling and the stash-then-reload pattern.

`tests/cpp/transforms/determine_ad_stack_size_test.cpp`

SizeExprBuildFromRangeFor - const bounds fold to a Const leaf.
SizeExprBuildFromFieldLoad - scalar i32 field load builds a FieldLoad leaf with the right SNode + index path.
SizeExprBuildFromLoopIndex - inner for j in range(i) under an enclosing bounded loop folds to LoopIndexMax.
SizeExprRejectsUnsupportedOp - negative control that an unsupported operator routes through the warn-and-fallback path.

Deliberate coverage gap: no SPIR-V cross-check here - the SPIR-V runtime metadata path lives in the stacked SPIR-V sizer PR and its tests cover the shader-side reading.

Side-effect audit

Concern	Where checked	Verdict
Offline cache key	`quadrants/analysis/offline_cache_util.cpp` serialises `SerializedSizeExpr::nodes` alongside the existing `size_in_bytes` per `AdStackAllocaStmt`	ok - cache-hit reloads the same tree; evaluator runs the same on fresh + cache-hit. Empty-tree caches (pre-PR offline caches) fall back to `default_ad_stack_size`.
Stmt clone / serialization	`statements.h`: `AdStackAllocaStmt::size_expr` is a `std::shared_ptr<SizeExpr>` cloned via the default `QD_DEFINE_ACCEPT_AND_CLONE`	ok - `SizeExpr` is immutable post pre-pass; shared pointer is safe to share across clones
IR printer	`ir/ir.cpp` `AdStackAllocaStmt::repr` prints `stack_id` and a short `SizeExpr` summary	ok - IR dump readable
Codegen non-feature path	`AdStackAllocaStmt::stack_id == -1` falls through to the old `default_ad_stack_size`-backed path	ok - kernels compiled without the pre-pass writing a `SizeExpr` still behave as before
Whole-kernel CSE	`SizeExpr` is stored on `AdStackAllocaStmt`, not emitted as IR operands, so it does not participate in CSE	ok
LaunchContextBuilder	extended to capture scalar + shape for every ndarray arg; no change to the existing argument-pack layout - new entries appended	ok - old kernels still build their context the same way
Runtime initialization	`adstack_{per_thread_stride,offsets,max_sizes}` initialised to zero / null in `runtime_initialize`; kernels with no adstacks read nothing from them and skip `publish_adstack_metadata` entirely	ok - zero-adstack kernels pay nothing
Multi-threaded CPU launch	metadata is written once per dispatch before the thread pool starts; no concurrent writers	ok
CUDA / AMDGPU `memcpy_host_to_device`	one memcpy per adstack-bearing task per launch; measured overhead ~microseconds. Forward-only kernels and reverse kernels whose pre-pass resolves all bounds at compile time are exempt.	acceptable; would be visible only on workloads that launch thousands of tiny autodiff kernels per second
Mismatched-length ndarray fusion	`expr_sub`'s cross-ndarray fusion caps the fused end at `min(shape_a, shape_b)`	ok - pre-fix UB on mismatched shapes; post-fix both reads stay in-bounds

Runtime-evaluated SizeExpr for SPIR-V adstack sizing

Stacked on #550. No behaviour change for kernels that do not opt into the experimental adstack path (ad_stack_experimental_enabled=True). Mirrors #550's LLVM runtime-metadata plumbing onto SPIR-V so Metal and Vulkan kernels size their per-dispatch adstack heaps from launch-time-evaluated SizeExpr trees, tightening the shader-baked per-thread stride / offset / max_size from compile-time immediates to runtime loads out of a new AdStackMetadata buffer. Also tightens the #550-layer warn-and-fallback into a QD_ERROR on unresolved grammar shapes so both backends converge on a single "inside-grammar is sized exactly, outside-grammar is a compile error" contract.

TL;DR

@qd.kernel
def compute(a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for i in x:                                    # outer struct-for
        v = x[i]
        n = a[i]                                    # ndarray-read-at-loop-index bound
        for _ in range(n):                          # dynamic inner range
            v = v * 0.95 + 0.01
        y[None] += v

On LLVM backends (#550) the reverse-mode adstack for v is already sized per-launch from the evaluated SizeExpr tree. This PR extends the same behaviour to the Metal and Vulkan backends: compute.grad(a) now sizes the per-thread heap to exactly n = a[i] slots per outer iteration on SPIR-V too, where before it either baked the compile-time fallback into the shader or hit a max_size > 0 codegen assert on dynamic bounds.

Why

#550 landed SizeExpr-at-launch-time sizing for LLVM and seeded a default_ad_stack_size compile-time fallback into the SPIR-V immediates so the existing SPIR-V codegen kept asserting cleanly. That seed is safe but wasteful: the SPIR-V shader's per-thread heap slice still has to be provisioned at the worst-case default_ad_stack_size for every alloca that was not statically const-foldable, even when the live field values would give a much tighter bound at launch. Kernels like the one above allocate n_outer_iters * default_ad_stack_size * 8 bytes per loop-carried variable on the device; at default_ad_stack_size=256 with a few dozen allocas that is tens of MB per thread on Metal - crossing the driver's maxBufferLength on anything larger than a toy workload.

The LLVM infrastructure is a complete blueprint: AdStackAllocaStmt::size_expr is captured during determine_ad_stack_size, OffloadedTask::ad_stack.allocas[i].size_expr persists through the offline cache, and LlvmRuntimeExecutor::publish_adstack_metadata evaluates each tree right before each dispatch. What was missing on SPIR-V was the final per-launch metadata hand-off to the shader. This PR adds it: one new BufferType::AdStackMetadata SSBO, one new TaskAttributes::ad_stack struct carrying per-alloca metadata plus serialized SizeExpr, one GfxRuntime::launch_kernel prelude that evaluates the trees and writes the buffer, and codegen rewrites every AdStackAllocaStmt / Push / LoadTop / AccAdjoint site to read its stride / offset / max_size from that buffer instead of emitting compile-time immediates.

This PR also tightens the grammar-unresolved path on both backends: where #550 falls back to default_ad_stack_size with a QD_WARN, this PR flips the fallback to a QD_ERROR so any adstack outside the grammar is a compile-time error naming the source location. Shapes inside the grammar are sized exactly per launch; shapes outside the grammar are rejected; the ad_stack_size=N escape hatch in qd.init() is retained for stress tests and for bypassing the sizer when debugging.

Entry point

The adstack pre-pass determine_ad_stack_size already runs for all backends (gated on autodiff_mode == kReverse && ad_use_stack) and populates AdStackAllocaStmt::size_expr; this PR only wires SPIR-V codegen and the GfxRuntime launcher to honour that metadata. The relevant files:

quadrants/codegen/spirv/kernel_utils.h - adds TaskAttributes::AdStackSizingAttribs / AdStackAllocaAttribs (replaces the scalar ad_stack_heap_per_thread_stride_{float,int} fields) and BufferType::AdStackMetadata.
quadrants/codegen/spirv/spirv_codegen.cpp - serialises each alloca's size_expr into task_attribs_.ad_stack.allocas; emits runtime loads for stride / offset / max_size via ensure_ad_stack_metadata_loaded.
quadrants/codegen/spirv/adstack_sizer_shader.cpp / .h - the on-device compute shader that evaluates each task's SizeExpr bytecode tree and writes per-task stride / offset / max_size into the AdStackMetadata buffer.
quadrants/runtime/gfx/runtime.cpp::GfxRuntime::launch_kernel - evaluates every task's SizeExpr tree upfront (before the cmdlist opens, to serialise against any reentrant SNodeRwAccessorsBank reader-kernel launches), then uploads per-task metadata to the AdStackMetadata buffer and binds it alongside the existing heap buffers.
quadrants/runtime/gfx/runtime.h::GfxRuntime::Params::program_impl - propagates the ProgramImpl back-reference so evaluate_adstack_size_expr(...) can reach ProgramImpl::program (same pattern the LLVM side uses).

Mechanism end-to-end

1. Codegen: serialise per-alloca metadata into `TaskAttributes`

Layer	File	What's new
Per-alloca attribs	`codegen/spirv/kernel_utils.h`	`AdStackAllocaAttribs { heap_kind, offset_in_elems_compile_time, max_size_compile_time, SerializedSizeExpr size_expr }`
Per-task attribs	`codegen/spirv/kernel_utils.h`	`AdStackSizingAttribs { per_thread_stride_{float,int}_compile_time, allocas[] }` (replaces the two scalar `ad_stack_heap_per_thread_stride_*` fields)
Capture at alloca site	`codegen/spirv/spirv_codegen.cpp::visit(AdStackAllocaStmt)`	`attribs.size_expr = stmt->size_expr->serialize()` when the pre-pass populated it

The pre-scan that sums per-type strides stays put; it just seeds the compile-time fallback values that the host uses when size_expr is empty (offline-cache-hit path without a serialised tree, i.e. pre-PR offline caches).

2. Codegen: runtime loads at every adstack site

Site	Before	After
`invoc_id * stride` at `get_ad_stack_heap_thread_base_{float,int}`	`OpIMul(invoc_id, uint_immediate(stride))`	`OpIMul(invoc_id, OpLoad(metadata_buffer[0 or 1]))`
Per-alloca `offset` at `ad_stack_heap_{float,int}_ptr`	`uint_immediate(heap_primal_offset)`	`OpLoad(metadata_buffer[2 + 2*stack_id])` (cached per alloca on `info.offset_val`)
`max_size` at `AdStackPushStmt` and `ad_stack_top_index`	`uint_immediate(stmt->max_size)` / `uint_immediate(max_size - 1)`	`OpLoad(metadata_buffer[2 + 2*stack_id + 1])` (cached on `info.max_size_val`) + one `OpISub` for the `max - 1` cap
Adjoint offset for f32 adstacks	`uint_immediate(primal + max_size)`	`OpIAdd(info.offset_val, info.max_size_val)` - one derived add instead of a second buffer load

Each AdStackSpirv::info caches its SSA ids for offset_val / max_size_val / adjoint_offset_val on first use and reuses them at every push / load / load_top_adj / acc_adjoint site, so each alloca pays one metadata read regardless of how many push/pop sites it has in the shader body.

Note: the invoc_id * stride OpIMul stays eagerly emitted at the first alloca site (not lazily at the first push) so it dominates every sibling inner-loop body - the same SPIR-V dominance rule #493's original heap-backing work already enforced. The only change is the stride now comes from OpLoad(metadata[0]) instead of uint_immediate(...).

3. Sizer shader capability gate + scope-state safety

build_adstack_sizer_spirv hard-errors through the empty-return path unless the device advertises spirv_has_physical_storage_buffer, spirv_has_int64, spirv_has_int8, and spirv_has_int16. The emit_psb_load_i64 per-dtype switch calls ir.i{8,16}_type() unconditionally - without those capabilities the accessors return default-constructed SType(id=0) and spirv-val rejects the binary - so gating on all four surfaces a clear "legacy device missing a required hardware feature" diagnostic at launcher time instead of an invalid-bytecode rejection at pipeline creation. GfxRuntime::launch_kernel mirrors the gate so the launcher's error message and the shader builder's return agree.

The shader body also clears scope[pending_var_id] to zero on every MaxOverRange pop (done_lbl). scope_arr is a single function-storage alloca zero-initialised once at main() entry, but var_id_counter resets per alloca, so different stacks in the same task reuse scope[0], scope[1], and so on. Without the pop-path clear, the NEXT stack's outer linear pre-order walk reads the previous stack's terminal bound value as an ExternalTensorRead index into a potentially-smaller ndarray - out-of-bounds on Metal (hung command buffer) or silently-zero on Vulkan with robustBufferAccess. The clear preserves the "index 0 is always valid for any non-empty ndarray" invariant the outer walk relies on.

4. Runtime: upfront metadata evaluation + buffer upload

GfxRuntime::launch_kernel evaluates every task's SizeExpr trees before ensure_current_cmdlist() opens the per-launch cmdlist. This is the load-bearing ordering change: the evaluator can call SNodeRwAccessorsBank::read_int(...) to resolve a field-load-bounded expression, which itself launches a reader kernel through GfxRuntime::launch_kernel - recursing into this same function with the outer cmdlist half-built would stomp current_cmdlist_. Upfront evaluation keeps the reader kernels serialised against any in-flight cmdlist through the normal submit/synchronize path.

The metadata buffer is allocated per task (not pooled across the launch) so a sibling task's host memcpy cannot race-overwrite this task's values between cmdlist record and cmdlist submit. Each per-task buffer is retired into ctx_buffers_ so it stays alive for the full submit/sync window. Layout:

u32[0]                 = stride_float
u32[1]                 = stride_int
u32[2 + 2*i + 0]       = alloca i's offset in its heap (elements)
u32[2 + 2*i + 1]       = alloca i's max_size

Host write happens through a device_->map_range(...) memcpy; unmap before bind. The shader reads via the same BufferType::AdStackMetadata access-chain plumbing as every other StorageBuffer in the task's bind set.

5. Program <-> runtime plumbing

GfxRuntime::Params gains a ProgramImpl *program_impl field, set by MetalProgramImpl::materialize_runtime and VulkanProgramImpl::materialize_runtime to this. The runtime uses it to reach program_impl_->program for evaluate_adstack_size_expr(...). Program::Program already sets program_impl_->program = this post-materialize_runtime, so the back-reference is non-null by the time any launch_kernel runs.

Per-backend coverage matrix

Backend	Pre-pass (from #550)	Codegen runtime loads	Per-launch evaluation	Heap grow on stride change
CPU (LLVM)	yes	yes (from #550)	yes (from #550)	yes (from #550)
CUDA (LLVM)	yes	yes (from #550)	yes (from #550)	yes (from #550)
AMDGPU (LLVM)	yes	yes (from #550)	yes (from #550)	yes (from #550)
Metal (SPIR-V)	yes	yes (this PR)	yes (this PR)	reused from #493's heap-grow path
Vulkan (SPIR-V)	yes	yes (this PR)	yes (this PR)	reused from #493's heap-grow path

The SPIR-V backends also continue to honour the max_size_compile_time fallback on the offline-cache-hit path where the symbolic SizeExpr tree was not serialised (empty size_expr.nodes); the runtime falls back to it verbatim, clamping to >= 1 like the LLVM counterpart does.

Tests

New tests

test_adstack_bounded_inner_loop_sized_by_structural_prepass (tests/python/test_adstack.py): pins the SPIR-V structural pre-pass correctness - a kernel whose inner range-for trip count is a constant product that the pre-pass must fold while the CFG analyzer would fall back to a pessimistic default. Arch-restricted to Metal / Vulkan because the LLVM backend routes the same kernel through the symbolic-tree path instead.
test_adstack_spirv_metadata_per_task_buffer (tests/python/test_adstack.py, cfg_optimization=False): pins the per-task AdStackMetadata buffer fix - with a shared buffer the second offload's host memcpy overwrites the first offload's metadata before submit, causing an Adstack overflow (offending stack_id=0) at qd.sync() even though each stack's sizer-computed bound is correct. cfg_optimization=False is load-bearing; it keeps the two sibling offloads separate so the record-then-execute race surfaces.
AdStackSizerShader.GateReturnsEmptyWhenRequiredCapIsMissing (tests/cpp/codegen/adstack_sizer_shader_test.cpp): pins the capability gate - dropping any one of PSB, Int64, Int8, Int16 from a synthetic caps config flips build_adstack_sizer_spirv to the empty-return path.
AdStackSizerShader.DumpBinary (tests/cpp/codegen/adstack_sizer_shader_test.cpp): sanity-only smoke test that dumps the compiled binary to /tmp/adstack_sizer.spv for local spirv-val / spirv-dis use.

Widened arch coverage

Existing tests that exercised the runtime-evaluated adstack on LLVM only now run on SPIR-V too (test_adstack_field_load_bounded_loop_evaluated_per_launch, test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index, test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var, test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends, test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_mismatched_shape_ends) by dropping the arch=[qd.cpu, ...] restriction that scoped them to the LLVM sizer path.

Known coverage gaps

No dedicated Python regression test for the scope[pending_var_id] pop-path clear. The failure mode is hardware-dependent: on Metal / MoltenVK and Vulkan-with-robustBufferAccess the spurious pre-pass out-of-bounds read is either caught silently or returns zero, both of which pending_max_accum (which starts at 0 at every MOR push) absorbs without surfacing. A test that fails deterministically would need a Vulkan device without robust buffer access or a simulated evaluator. The fix is mechanism-verified through code review and covered indirectly by the full adstack suite running through the sizer.

Full suite: pytest tests/python/test_adstack.py -n 8 -> 712 passed, 7 xfailed on macOS arm64 (CPU + Metal + Vulkan / SwiftShader); wider backend coverage runs through the Linux AMDGPU box.

Side-effect audit

Concern	Where checked	Verdict
Offline cache key	`TaskAttributes::QD_IO_DEF(..., ad_stack)` via the new `AdStackSizingAttribs::QD_IO_DEF(per_thread_stride_{float,int}_compile_time, allocas)` and `AdStackAllocaAttribs::QD_IO_DEF(heap_kind, offset_in_elems_compile_time, max_size_compile_time, size_expr)`	Safe - any schema change (field rename / reordering) invalidates the cache at the serializer level, same as every other `TaskAttributes` change. Pre-PR caches without the new fields are invalidated on first load and recompiled.
Buffer binding numbering	`TaskCodegen::binding_head_` unchanged; `AdStackMetadata` is appended lazily via `get_buffer_value(...)` like every other buffer kind, so existing bindings (Args, Rets, Root, ExtArr, AdStackOverflow, AdStackHeap*) keep their bindings	Safe - the metadata buffer lands at a fresh binding slot per task.
Tasks with no adstack	`attribs.ad_stack.allocas.empty()` short-circuits both the runtime evaluation loop and the metadata buffer allocation; codegen never calls `get_ad_stack_metadata_buffer()` in that case, so no `BufferType::AdStackMetadata` entry is added to `buffer_binds`	Safe - forward-only kernels and reverse kernels with only const-resolved allocas are unaffected.
Mid-cmdlist recursion	Metadata evaluation moved to before `ensure_current_cmdlist()` in `launch_kernel`	Safe - `SNodeRwAccessorsBank::read_int` reader-kernel launches now go through the normal submit/sync path; no partially-built cmdlist to stomp.
Legacy `max_size` immediates	The pre-scan still computes `ad_stack_heap_per_thread_stride_{float,int}_` as a compile-time sum and carries them into `AdStackSizingAttribs::per_thread_stride_{float,int}_compile_time`; the runtime uses those when every alloca's `size_expr.nodes.empty()` (pre-PR offline cache hit)	Safe - the cache-hit path reproduces the pre-PR shader layout.
Dominance of `invoc_id * stride`	`get_ad_stack_heap_thread_base_{float,int}` still eagerly emits at the first alloca site, just with `OpLoad` of the metadata stride instead of `uint_immediate`	Safe - the SSA id is defined in the same block that dominates every downstream push/load-top body.
Per-task metadata buffer lifetime	Each task record allocates a fresh `AdStackMetadata` buffer and retires the previous one into `ctx_buffers_`	Safe - the buffer stays alive through the submit/sync window; no host memcpy can overwrite a sibling task's values before its dispatch runs.
Cross-stack `scope_arr` reuse	MOR `done_lbl` clears `scope[pending_var_id]` to zero	Safe - the next stack's outer pre-pass reads land on index 0, which is valid for any non-empty ndarray (a grammar invariant).
Non-ASCII characters	`adstack_sizer_shader.cpp` comments use ASCII arrows (`->`) only	Safe - no smart quotes / em-dashes / Unicode in committed artifacts.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86fb79c378

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…wn-link-check stops failing

… the AllocaStmt site so they dominate every sibling push / load-top / acc-adjoint block

…lag, symmetric d2h grad gate, IRBuilder::cast skips OpBitcast(T,T), compute_dense_snode_strides refuses multi-child parents, kMaxNodes host-side guard, restore mismatched-lengths fusion regression, scrub remaining external-project references

…me/gfx/adstack_sizer_launch.cpp so runtime.cpp stays focused on the main-kernel record/submit flow

…reter switch as a no-op so -Wswitch stays silent; the LLVM encoder always host-folds FieldLoad leaves, so this arm is unreachable but keeps the enum coverage exhaustive

…iness and widen Supported loop shapes table from the test suite

…k bytecode offsets up to 256B for VUID-02999-safe storage-buffer descriptor bindings, bump kMaxVars to kAdStackSizeExprDeviceMaxBoundVars (32) and expose kAdStackSizerMaxPendingFrames so the host encoder hard-errors before the shader's fixed-size pending-frame stack can OOB, clamp pending_end_arr to begin+(1<<24) so MaxOverRange pathological ranges silently break matching the LLVM interpreter and host evaluator

…AD | WRITE) so a kernel that only non-atomic-writes .grad does not leave allocator garbage on device for d2h to blit back, and update kExternalTensorRead comment in adstack_size_expr_device.h to match the (idx, stride) pair layout the encoder actually emits

… alongside the SPIR-V-scope tests: parametrize the mismatched-lengths test with a peak-past-shape-b case that exercises the cross-ndarray independent-loops fusion path, and add a fld[i]-arr[i] test that guards against the fusion synthesizing a mixed FieldLoad+ExternalTensorRead body the LLVM-GPU encoder cannot lower

…ack sizer dispatch via track_physical_buffer, mirroring what the main-kernel dispatch already does at runtime.cpp's pre-dispatch block; without the useResource: hint the Apple7 GPU family (M1) returns zero for the sizer shader's PSB load of the ndarray pointer through the kernel arg buffer and every MaxOverRange-over-ExternalTensorRead collapses to zero, tripping an Adstack overflow at the next qd.sync(); non-Metal backends no-op the base track_physical_buffer and pay nothing

hughperkins · 2026-04-25T07:06:50Z

 **Problem.** Reverse-mode AD through a dynamic loop (one whose trip count is not known at compile time) needs to recover the primal value at each iteration when walking the loop backwards. Without that, the chain-rule steps read a stale value and the gradients come out silently wrong. Static-unrolled (`qd.static(range(...))`) loops are not affected because every iteration becomes its own inlined block at compile time.

-**How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. adstack is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.
+**How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes one entry per iteration. The reverse pass walks the stack from top down: at each reverse iteration it reads the current top entry as many times as that iteration's chain-rule contributions need (one `read` per downstream use of the primal), then pops the entry once and steps to the iteration underneath. Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a `v` that appears in both a `sin(v)` and a subsequent `v * w` needs to be visible to both adjoint terms before it is discarded. Enabling adstack costs extra per-thread memory and compile time, but some kernels need it.


"from top down: at each reverse " => "from top down. At each reverse "

"(one read per downstream use of the primal)"

=> move to a new sentence after the current one, (so we don't need to pusht he current sentence onto the stack whilst reading, then read the parentheses, then pop the ucrrent setnecne back off the stack in our mind, which makes things harder to read I feel)

"Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a v that appears in both a sin(v) and a subsequent v * w needs to be visible to both adjoint terms before it is discarded. " => This seems like an implementation detail, not needed by the end-user? => "under the hood" style section

"The flag is compile-time, so it must be set before the offending kernel compiles. " => this seems superfluous? Remove, or move to 'under the hood' (you are justifying your impelmetnation design I feel, rather than telling the user what they need to know in order to use the feature).

hughperkins · 2026-04-25T07:12:19Z

+**How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes one entry per iteration. The reverse pass walks the stack from top down: at each reverse iteration it reads the current top entry as many times as that iteration's chain-rule contributions need (one `read` per downstream use of the primal), then pops the entry once and steps to the iteration underneath. Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a `v` that appears in both a `sin(v)` and a subsequent `v * w` needs to be visible to both adjoint terms before it is discarded. Enabling adstack costs extra per-thread memory and compile time, but some kernels need it.

-**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles.
+**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU).


"Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU)." => this seems like details best left for later? (maybe some 'compatibilty' section or simlar?)

hughperkins · 2026-04-25T07:13:24Z

-**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles.
+**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU).
+
+**Note.** Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.


"A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface" => does this mean the gradients will in fact be correct? or will be silently wrong?

"if you see wrong-but-non-zero gradients through a dynamic loop " => how to detect this?

how to detect this

There is no way to detect this I'm afraid. Just running some external finite difference check, or precising turning it on and rerun as a sanity check. We can just say: "Checking that the value of the gradient is unaffected when turning adstack on is the best way to determine whether adstack is necessary."

"Checking that the value of the gradient is unaffected when turning adstack on is the best way to determine whether adstack is necessary." Yup, if that's what we have, let's run with that. Maybe loosen "the best way" to "is a reasonable way"

hughperkins · 2026-04-25T07:14:41Z

+
+**Note.** Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.

 Reverse-mode AD walks the forward kernel in reverse and applies the chain rule at every op. The chain-rule factor at each op is that op's derivative with respect to its input. For *non-linear* ops (`sin`, `cos`, `exp`, `sqrt`, `tanh`, `pow`, ...) that derivative depends on the input's primal, so the reverse pass needs the primal value that was there on the forward pass. For *linear* ops (addition, subtraction, multiplication by a constant) the derivative is itself a constant and no primal is needed. In a dynamic loop the forward pass writes a different primal at each iteration, so the reverse pass cannot simply re-read the latest value - it needs one per iteration. adstack provides exactly that: a per-iteration stash of the primal.


Nice and clear to me 🙌

hughperkins · 2026-04-25T07:19:23Z

+### Supported loop shapes

-*You do not need to read this section to use reverse-mode AD. If a kernel exceeds its adstack capacity, Quadrants raises a Python exception at the next `qd.sync()` whose message recommends bumping `default_ad_stack_size` - that is usually enough. Read on only if you hit that overflow, want to understand why, or want to cap the memory footprint explicitly.*
+The compiler recognises a set of bound shapes:


I think we should just delete this seciton, or move it to some highly advanced seciton.

Users will just expect stuff to work I feel, and anything else is a bug.

What's an example of something that does NOT work currrnelty?

What's an example of something that does NOT work currrnelty?

Actually I would prefer to keep it. We only support a small subset of all the craziness a user may want to do. Here are examples that are not supported:

@qd.kernel def my_kernel(a) for i in range(sqrt(a)): # on any non-linear transformation plus min/max [...]

@qd.kernel def my_kernel(a, b) for i in range(a): for j in range(b): for k in range(a[i], b[j]): # Non-matching running variables+indices [...]

@qd.kernel def my_kernel(a) for i in range(a.shape[0]): while a[i] < 10: # Bound that cannot be determined without running the kernel a[i] = a[i] + 1

Alright, so move to a reference section. And state something like "Note that whilst we support many common loop constructs, many constructs which are allowed in the absence of adstack are not currently handled. See Appendix A for supported loop constructs."

hughperkins · 2026-04-25T07:19:55Z

+The compiler recognises a set of bound shapes:

-**Tuning.** Two `qd.init()` knobs control adstack sizing, both measured in slots per adstack (not bytes):
+| Bound shape | Example |


this table seems like something to go in a reference section.

"The compiler recognises a set of bound shapes (see 'appendix 123' for authorative list). Any loop shape outside this set is rejected at compile time. The error names the offending source line; typical fixes are to restructure the loop into one of the shapes in the appendix, or to file a bug. "

(but see above comment that we should just delete this seciton altogether, or move the entire section to highly advanced seciton)

Where would you like to put this appendix? Another markdown page?

At the end of the same markdown page.

hughperkins · 2026-04-25T07:23:14Z

+### Under the hood (advanced)

-**Sizing rule.** A `K`-iteration dynamic loop consumes `K + 2` slots in each of its adstacks: one slot per forward iteration, plus two setup slots (one for the initial adjoint, one for the primal's starting value). `default_ad_stack_size` is a per-stack slot count, so size it at the worst-case trip count of the deepest unprovable dynamic loop in the program plus 2.
+*This section is a glimpse of how the adstack is sized and laid out in memory. You never need to read it to use reverse-mode AD - skip it unless you are curious, debugging an overflow or OOM, or planning to extend the list of supported loop shapes.*


"You never need to read it to use reverse-mode AD " => "never" seems a very strong claim. I'm pretty skeptical of this claim given the constraints on memory, loop structure etc.

I prefer the earlier pharsing. I don't remember what that was.

Oh here we go:

"You do not need to read this section to use reverse-mode AD. Skip past it unless you hit an overflow error on SPIR-V, an out-of-memory error on GPU, or a compile error pointing at an adstack alloca."

so:

remove the first sentence, superfluous

remove 'never'

break second sentence into two

basically just use the original paragraph I pasted above :)

hughperkins · 2026-04-25T07:28:11Z

+| `num_buffers` | Number of adstacks the kernel allocates - one per loop-carried variable plus one per dependent branch flag (see [One adstack per variable](#one-adstack-per-variable)). |

-On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time:
+Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`.


"but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens bool to i32 at storage time, because SPIR-V has no defined layout for OpTypeBool." => I would break this platform-specific stuff into a seprate paragraph somehow.

"Platform-specific notes:

even though integer/boolean adstacks do not need an adjoint slot, LLVM backends still carry one for codegen uniformity

SPIR-V stores bools using 4 bytes (32 bits)"

hughperkins · 2026-04-25T07:28:50Z


-On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time:
+Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`.



"Therefore some examples of the size of adstack slots on each platform are as follows:"

hughperkins · 2026-04-25T07:32:05Z

- Reverse-mode AD does not propagate gradients through integer casts or non-real operations. No error is raised; the gradient simply stops at the cast and silently reads as zero upstream. Cast to `qd.f32` / `qd.f64` before the differentiable section.
- Backward passes on non-trivial kernels run noticeably slower than the corresponding forward pass, sometimes by an order of magnitude on SPIR-V.
- **Loop bounds read from a writable ndarray are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]`, the sizer snapshots `n[j]` at backward-dispatch entry. Any scenario that makes that snapshot diverge from the value the kernel body ends up executing against drives the adstack undersized and the gradient silently comes out zero or wrong. Two patterns to avoid: (a) the same kernel writes `n[j]` before the loop reads it (`for i in range(n[j])` after `n[j] = something`); (b) across kernels, where `kernel_A` writes `n` and `kernel_B.grad()` reads it as a loop bound - the per-launch numpy-ndarray upload can overwrite the device copy with host-side state that differs from what `kernel_A` left behind, and for `qd.ndarray` a missing host-side barrier between the two launches has the same effect. Quadrants does not detect either pattern. Workaround: treat ndarrays the reverse-mode kernel reads as loop bounds as read-only within and across the relevant kernel chain - compute iteration counts once, store them in a dedicated scalar field or ndarray that is never written during the backward path, and iterate over that.
+- **When a dynamic-loop reverse kernel fails.** In normal use, enabling the adstack pipeline and running a reverse-mode kernel through a dynamic loop should just work. Two rare situations can make it fail, and it is worth knowing which one you are in:


"In normal use, enabling the adstack pipeline and running a reverse-mode kernel through a dynamic loop should just work. Two rare situations can make it fail, and it is worth knowing which one you are in:" => this si superfluous I feel.

Are these really limitations? It seems like saying "a limitation of numpy tensors is that they cannot be larger than available physical memory."? 🤔

These should probably go in some "Gotchas" or "What can go wrong" or simliar kind of section. I'm not sure they are limtiations?

Well, the thing is that it can eat memory pretty fast, and what is being allocated and why is not immediately obvious. When you allocate a gigantic numpy array you will notice it, here it is hidden.

I would argue that a 'limitation' is some issue with our implementation, rather than a fundamental characteristic of the underlying maths etc.

As suggested, lets have a section on things that can go wrong, that users can consult.

Analogy:

doc 1:
"Limitations:

this toaster does not work when it is not plugged into the power"

doc 2:
"Why doesn't my toaster work?

check: is it plugged into the power?"

hughperkins · 2026-04-25T07:34:42Z

+    - split the inner section into its own `@qd.kernel`;
+    - pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
+  - *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
+- **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads.


is this a limtiation? Seems more like some kind of performance characteristic?

Where do you want to mention this kind of information if not in limitations?

"Performance characteristics"?

"Performance"?

hughperkins · 2026-04-25T07:35:50Z

+    - pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
+  - *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
+- **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads.
+- **Backward passes are slower than forward passes.** In particular, SPIR-V backward passes can be an order of magnitude slower than the forward pass.


this isnt really a limitation I feel. In general, in the convnet world at least, backward passes are matehtmatically twice as long as forward passes. Thtas not really a limtation. It just cmes out of the maths.

Maybe just limit to "SPIR-V backward passes can be an order of magnitude slower than the forward pass" (maybe say why?)

hughperkins · 2026-04-25T07:36:45Z

+  - *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
+- **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads.
+- **Backward passes are slower than forward passes.** In particular, SPIR-V backward passes can be an order of magnitude slower than the forward pass.
+- **Integer casts stop gradients.** Reverse-mode AD does not propagate gradients through integer casts or non-real operations: integers have no meaningful derivative, so the gradient reads as zero upstream of the cast - no error, just silently-zero gradients. Rules of thumb:


is this really a limitation? This is just a characteristc of differentiation I think? You cannot pass a gradient through a discretization step. (at least, not without using approximatinos, such as Gumbo etc).

hughperkins · 2026-04-25T07:37:58Z

+  - keep differentiable variables in `qd.f32` / `qd.f64` through the full forward chain;
+  - casting *to* a float is safe - the downstream float section remains differentiable, and the cast itself contributes a unit factor to the chain rule;
+  - casting *back to* an integer stops the gradient at that point, so only do it at integer-indexing sites, after any arithmetic whose gradient you need.
+- **Loop bounds that read an ndarray written to by the same kernel are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]` and the same kernel also writes to `n[j]` before that loop runs (`for i in range(n[j])` after `n[j] = something`), the computed gradient may silently come out zero or wrong. Quadrants does not currently detect this. Workaround: put iteration counts in a separate ndarray (or scalar field) the kernel only reads, or split the write and the differentiable loop into two `@qd.kernel` calls. The same caveat applies across kernels - a kernel that writes an ndarray read as a loop bound by a later `.grad()` call needs the host-side contents of that ndarray to already hold the value the backward kernel will execute against.


it's not clear to me from skim-reading whether this is a limitation of our implemetation, or just a fundamental charcteristic that falls out of the maths. I'm guessing the former. Can you clarify in reply to this comment please.

I don't know exactly. What I know for sure is that I found it very unexpected when I discovered this limitation and that it is not something uncommon. We need this in Genesis typically. I would be surprised if such limitation exists in Torch, otherwise more people would compute complete wrong gradient x) The biggest issue is that it fails silently.

…section: split "Known limitations" into "What can go wrong" + "Performance characteristics", move the supported-loop-shapes table to Appendix A with a scoped title, relocate "Integer casts stop gradients" to the main reverse-mode section as a property of differentiation, rewrite the mutated-bound bullet against the verified sizer-reads-twice mechanism with a Torch contrast, strip OpTypeBool/alloca/compile-time-synthesized jargon, anchor SPIR-V GPU compatibility to late-2018 hardware; add test_adstack_sizer_trip_count_qd_ndarray_mutated_by_separate_kernel as a strict xfail pinning the cross-kernel pattern verified failing on cpu/metal/vulkan

claude · 2026-04-25T15:19:16Z

+int32_t compute_max_mor_nesting(const SerializedSizeExpr &expr) {
+  std::vector<int32_t> depth(expr.nodes.size(), -1);
+  std::function<int32_t(int32_t)> visit = [&](int32_t i) -> int32_t {
+    if (i < 0 || static_cast<std::size_t>(i) >= expr.nodes.size())
+      return 0;
+    if (depth[i] >= 0)
+      return depth[i];
+    const auto &n = expr.nodes[i];
+    int32_t child_max = 0;
+    for (int32_t c : {n.operand_a, n.operand_b, n.body_node_idx}) {
+      if (c >= 0)
+        child_max = std::max(child_max, visit(c));
+    }
+    int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0;
+    depth[i] = self + child_max;
+    return depth[i];
+  };
+  int32_t max_depth = 0;
+  for (std::size_t i = 0; i < expr.nodes.size(); ++i) {
+    max_depth = std::max(max_depth, visit(static_cast<int32_t>(i)));
+  }
+  return max_depth;
+}


🟡 compute_max_mor_nesting at adstack_size_expr_eval.cpp:330-352 over-counts MOR depth by walking operand_a and operand_b as if they accumulate alongside body_node_idx, but in case_mor of the SPIR-V sizer shader (adstack_sizer_shader.cpp:706-767) begin and end subtrees evaluate fully BEFORE the outer MOR pushes its body frame, so they only contribute transient depth. The correct recurrence is max(visit(operand_a), visit(operand_b), self + visit(body_node_idx)); the current self + max(...) is sound but causes spurious QD_ERROR rejections at the kAdStackSizerMaxPendingFrames=16 cap for kernels whose true peak fits but whose begin/end MOR chains push the apparent depth over. Nit severity: over-conservative bound, not a memory-safety issue.

Extended reasoning...

What the bug is

compute_max_mor_nesting at quadrants/program/adstack_size_expr_eval.cpp:330-352 walks the post-order tree and, for any MaxOverRange node, returns:

int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0; depth[i] = self + child_max; // child_max = max(visit(op_a), visit(op_b), visit(body))

i.e. self + max(visit(operand_a), visit(operand_b), visit(body_node_idx)). The encoder hard-errors at encode_bytecode_common when this depth exceeds spirv::kAdStackSizerMaxPendingFrames = 16 (adstack_size_expr_eval.cpp:711-716).

Why the recurrence does not match the runtime

The SPIR-V sizer shader processes nodes iteratively in post-order. In case_mor (quadrants/codegen/spirv/adstack_sizer_shader.cpp:706-767):

Lines 709-710 read begin_i64 = load_values_at(op_a_i32) and end_i64 = load_values_at(op_b_i32) -- already-cached values from values_arr[]. Because the bytecode is post-order, the op_a and op_b subtrees were fully processed BEFORE control reached this MOR node: any nested MaxOverRange inside them executed its own case_mor (push), walked its body, and hit done_lbl (pop), restoring sp to whatever it was before that subtree began.

Lines 743-756 push exactly ONE pending frame for this MOR (the parent), then redirect current to body_start. The body chain runs under that pushed frame.

So during evaluation of operand_a / operand_b, the outer MOR is NOT yet on the pending stack. Only the body chain runs with the outer frame pushed. The actual peak pending-frame depth is:

peak(MOR) = max(peak(operand_a), peak(operand_b), 1 + peak(body))

NOT 1 + max(peak(operand_a), peak(operand_b), peak(body)) as the helper computes.

Step-by-step counterexample

Take MOR_outer(operand_a = MOR(MOR(MOR(plain))), operand_b = plain, body = plain):

Linear walk processes the 3-deep nested MOR chain on the operand_a side first. Each nested MOR runs case_mor: push frame (sp += 1), walk body, hit done_lbl (pop, sp -= 1). The chain reaches a transient peak of sp = 3 while inside the innermost body, then unwinds back to sp = 0.

After op_a finishes, op_b (a plain leaf) writes to values[] without touching the pending stack. sp stays at 0.

Control reaches MOR_outer's case_mor. It pushes one frame (sp 0->1) and walks its body (a plain leaf), then pops (sp 1->0).

True peak across the whole evaluation = 3 (attained inside the begin chain, with MOR_outer not yet on the stack).

compute_max_mor_nesting(MOR_outer) returns: 1 + max(visit(begin), visit(end), visit(body)) = 1 + max(3, 0, 0) = 4.

The over-count is exactly +1 per outer MOR whose max(begin_depth, end_depth) > 1 + body_depth. With kAdStackSizerMaxPendingFrames = 16, a kernel whose true peak depth is 16 but whose begin/end MOR chains accumulate to 16+ via this recurrence is rejected at host encoding time even though it would run correctly.

Why existing code does not catch it

The cap check is the only consumer of compute_max_mor_nesting. It treats the helper's output as a worst-case at-runtime depth and rejects trees above 16. There is no runtime cross-check against the actual peak sp in the shader, so the discrepancy surfaces only as a false-positive compile error when the helper's result crosses 16 but the real peak does not. Since the inequality is always helper >= true_peak, the cap of 16 still safely prevents OOB on the shader-side pending_*_arr[16].

Impact

Severity is nit because the over-count is sound -- it only ever rejects kernels that would actually run correctly. Realistic reverse-mode kernels have shallow MOR nesting (verifiers note <=4 distinct bound vars, ~10 tree depth observed); the trigger requires deep MOR chains (>1 nesting) inside begin/end of an outer MOR, which is uncommon today but can compound through nested wrapping in larger kernels. No memory-safety hazard, no UB, no under-counting. The only user-visible effect is a spurious MaxOverRange nesting depth ... exceeds the sizer shader's ... compile error on a kernel that would otherwise fit.

Fix

One-line change at line 339-344:

const auto &n = expr.nodes[i]; int32_t a_depth = (n.operand_a >= 0) ? visit(n.operand_a) : 0; int32_t b_depth = (n.operand_b >= 0) ? visit(n.operand_b) : 0; int32_t body_depth = (n.body_node_idx >= 0) ? visit(n.body_node_idx) : 0; int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0; depth[i] = std::max({a_depth, b_depth, self + body_depth});

This matches the runtime peak exactly: begin and end subtrees contribute their own peaks (taken individually since they are sequential, not stacked), and the body contributes the outer MOR's pushed frame plus the body chain's own depth. The encoder's hard-error guard then fires only when the actual runtime peak would exceed 16.

claude · 2026-04-25T15:52:59Z

+  if (autodiff_mode == AutodiffMode::kReverse && ad_use_stack) {
+    irpass::determine_ad_stack_size(ir, config);
+    print("Autodiff stack size determined");
+  }


🔴 scalarize drops AdStackAllocaStmt::size_expr (and stack_id) when lowering tensor-typed adstacks to scalar ones. Because this PR moves determine_ad_stack_size from offload_to_executable into compile_to_offloads (lines 160-163), the pre-pass now runs BEFORE scalarize and populates size_expr/max_size=1 on the tensor adstack; scalarize.cpp:719 then creates fresh scalar AdStackAllocaStmts via make_unique(element_type, stmt->max_size) which only copies max_size, leaving size_expr=nullptr on every scalar child. Any reverse-mode kernel with a Vector/Matrix loop-carried variable inside a field- or ndarray-bounded dynamic loop overflows on the first push and surfaces as 'Adstack overflow' at qd.sync(). Fix is one line in scalarize.cpp:719+: copy stmt->size_expr (a shared_ptr; safe to share since SizeExpr is documented immutable post-pre-pass) into each scalar child.

Extended reasoning...

What the bug is

The PR moves determine_ad_stack_size from offload_to_executable (after scalarize) to the end of compile_to_offloads at lines 160-163 (before scalarize). auto_diff.cpp:752 creates tensor-typed AdStackAllocaStmts for Vector/Matrix loop-carried variables (dtype = alloc->ret_type.ptr_removed()). The new pre-pass position means determine_ad_stack_size now sees these tensor-typed adstacks and, for non-const symbolic bounds, populates size_expr and seeds max_size = 1 as a placeholder (determine_ad_stack_size.cpp:1342-1344).

Then scalarize.cpp:712-728 runs (gated on config.real_matrix_scalarize, default true) and creates fresh scalar adstacks:

auto scalar_ad_stack = std::make_unique<AdStackAllocaStmt>(element_type, stmt->max_size);

The constructor at statements.h:1611 only initializes dt and max_size; size_expr (line 1606) and stack_id (line 1609) keep their default values of nullptr and -1. Each scalar child carries max_size=1 and size_expr=nullptr, then the original tensor adstack is erased (line 727).

Step-by-step proof

Concrete trigger:

@qd.kernel def k(): for i in x: v = qd.Vector([0.0, 0.0, 0.0]) for _ in range(n[None]): # field-bounded dynamic loop v = qd.sin(v) + 0.1 y[i] = v

auto_diff.cpp:752 makes AdStackAllocaStmt(TensorType<3 x f32>, 0) for v.

Bellman-Ford leaves max_size=0 (positive loop).

determine_ad_stack_size (now at compile_to_offloads.cpp:161) walks pushes, builds size_expr = MaxOverRange(... FieldLoad(n)), sets max_size=1 placeholder per the comment at determine_ad_stack_size.cpp:1338-1340.

scalarize runs in offload_to_executable at line 302. visit(AdStackAllocaStmt) creates 3 scalar adstacks with make_unique<AdStackAllocaStmt>(element_type, 1) — size_expr=nullptr on each.

Codegen pre-scan in init_offloaded_task_function (codegen_llvm.cpp:1769-1782) assigns stack_id and pushes alloca->size_expr ? alloca->size_expr->serialize() : SerializedSizeExpr{} — empty for every scalar child. AdStackAllocaInfo::max_size_compile_time = 1.

The codegen assertion stmt->max_size > 0 || stmt->size_expr passes (1 > 0). Stack id assigned, heap stride sized to 1 slot.

Runtime publish_adstack_metadata (llvm_runtime_executor.cpp:671) sees size_exprs[i].nodes.empty(), falls back to max_size_compile_time=1. Per-thread heap holds 1 slot per scalar adstack.

First push to v's scalar adstack overflows the per-thread slice. Surfaces as Adstack overflow at the next qd.sync().

Why existing code does not prevent it

Pre-PR, determine_ad_stack_size ran AFTER scalarize (in offload_to_executable, called at line 287 of the pre-PR file), so scalar adstacks created by scalarize had their max_size resolved by the analyzer directly. The new ordering inverts this: the symbolic bound is computed on the soon-to-be-deleted tensor adstack, then dropped on the floor.

Existing tensor-adstack tests in tests/python/test_adstack.py do not catch this: every Vector/Matrix reverse-mode test either uses ad_stack_size=N (the global override that bakes max_size=N into the alloca and bypasses the pre-pass entirely — the gather predicate at determine_ad_stack_size.cpp filters max_size==0) or uses a static iteration count (where Bellman-Ford resolves max_size directly to a constant > 1, which scalarize correctly copies).

Impact

Silent under-sizing of the per-thread adstack heap on a realistic kernel shape (Vector/Matrix loop-carried variables with field/ndarray-bounded dynamic loops). Surfaces as a runtime overflow at qd.sync() on every backend, with the misleading message that points at restructuring the loop or extending the grammar even though the structural pre-pass DID resolve it correctly. The grammar gap message is wrong for this case — the bound is fine, the propagation is broken.

Fix

One-line change in scalarize.cpp:719+:

auto scalar_ad_stack = std::make_unique<AdStackAllocaStmt>(element_type, stmt->max_size); if (stmt->size_expr) { scalar_ad_stack->size_expr = stmt->size_expr; } scalar_ad_stack->ret_type = element_type; scalar_ad_stack->ret_type.set_is_pointer(true);

SizeExpr is documented as immutable post-pre-pass (PR description's Side-effect audit row on Stmt clone / serialization), so the shared_ptr copy is cheap and safe to share across scalarized children. The original tensor adstack is being erased anyway, so the parent's shared_ptr ownership transfers naturally.

An alternative would be to re-run determine_ad_stack_size after scalarize (the deleted call site at offload_to_executable.cpp:288), but option (a) is smaller and matches the PR's stated immutability invariant.

hughperkins · 2026-04-25T16:27:05Z

Nice doc! Thanks!

Checklist:

doc has been updated, and I've reviewed
many of the new lines of code appear to be added to new files, dedicated to the new adstack sizing functionality (rather than mixed into existing files)

=> ok to merge

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

duburcqa changed the title ~~[AutoDiff] Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path~~ [AutoDiff] Autodiff 16: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path Apr 22, 2026

duburcqa marked this pull request as draft April 22, 2026 05:16

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 86fb79c to e38637b Compare April 22, 2026 05:20

claude Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

duburcqa changed the title ~~[AutoDiff] Autodiff 16: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path~~ [AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path Apr 22, 2026

duburcqa force-pushed the duburcqa/runtime_adstack_capacity branch from 853b28d to 7e26214 Compare April 22, 2026 22:56

duburcqa changed the base branch from duburcqa/runtime_adstack_capacity to duburcqa/moltenvk_sdk_source April 22, 2026 22:56

duburcqa changed the title ~~[AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path~~ [AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V Apr 22, 2026

duburcqa changed the title ~~[AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V~~ [AutoDiff] Autodiff 17: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr for SPIRV Apr 22, 2026

duburcqa changed the title ~~[AutoDiff] Autodiff 17: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr for SPIRV~~ [AutoDiff] Autodiff 17: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr on SPIRV backends (Metal/Vulkan) Apr 22, 2026

duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 24f73f5 to 453d0a8 Compare April 23, 2026 05:28

Base automatically changed from duburcqa/moltenvk_sdk_source to duburcqa/runtime_adstack_capacity April 23, 2026 05:28

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from e38637b to 16acc6b Compare April 23, 2026 05:28

duburcqa force-pushed the duburcqa/runtime_adstack_capacity branch from bc3bf3f to 453d0a8 Compare April 23, 2026 05:31

duburcqa changed the base branch from duburcqa/runtime_adstack_capacity to duburcqa/adstack_llvm_runtime_size April 23, 2026 05:34

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 16acc6b to 6d6b0d1 Compare April 23, 2026 05:43

duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from bc3bf3f to b757b5a Compare April 23, 2026 05:43

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 6d6b0d1 to bd11a8a Compare April 23, 2026 05:54

duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from b757b5a to 4b1b402 Compare April 23, 2026 05:54

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from bd11a8a to 08789d5 Compare April 23, 2026 06:07

duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch 2 times, most recently from babc027 to 5182192 Compare April 23, 2026 06:12

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 08789d5 to 87a2224 Compare April 23, 2026 06:12

duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 5182192 to f5afee0 Compare April 23, 2026 07:57

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 87a2224 to deb7004 Compare April 23, 2026 07:57

duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from f5afee0 to b472593 Compare April 23, 2026 07:59

duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch 3 times, most recently from 5b3a7ed to 1c28866 Compare April 23, 2026 09:33

duburcqa added 10 commits April 24, 2026 23:36

[AutoDiff] Drop broken #runtime-check anchor in autodiff.md so markdo…

dd5508d

…wn-link-check stops failing

[AutoDiff] Load per-alloca AdStackMetadata offset/max_size eagerly at…

d5400c1

… the AllocaStmt site so they dominate every sibling push / load-top / acc-adjoint block

[AutoDiff] Extract GfxRuntime adstack sizer dispatch block into runti…

c6a6df0

…me/gfx/adstack_sizer_launch.cpp so runtime.cpp stays focused on the main-kernel record/submit flow

[AutoDiff] Handle kFieldLoad in the LLVM runtime-module device interp…

fb00378

…reter switch as a no-op so -Wswitch stays silent; the LLVM encoder always host-folds FieldLoad leaves, so this arm is unreachable but keeps the enum coverage exhaustive

[AutoDiff] Rewrite autodiff.md dynamic-loops section for user-friendl…

dcb6636

…iness and widen Supported loop shapes table from the test suite

hughperkins reviewed Apr 25, 2026

View reviewed changes

duburcqa added 2 commits April 25, 2026 16:33

Merge branch 'main' into duburcqa/adstack_spirv_runtime_size

175a52e

claude Bot reviewed Apr 25, 2026

View reviewed changes

duburcqa mentioned this pull request Apr 30, 2026

[AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends #599

Merged


		Note. Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.

		Reverse-mode AD walks the forward kernel in reverse and applies the chain rule at every op. The chain-rule factor at each op is that op's derivative with respect to its input. For non-linear ops (`sin`, `cos`, `exp`, `sqrt`, `tanh`, `pow`, ...) that derivative depends on the input's primal, so the reverse pass needs the primal value that was there on the forward pass. For linear ops (addition, subtraction, multiplication by a constant) the derivative is itself a constant and no primal is needed. In a dynamic loop the forward pass writes a different primal at each iteration, so the reverse pass cannot simply re-read the latest value - it needs one per iteration. adstack provides exactly that: a per-iteration stash of the primal.


		On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time:
		Every adstack slot always stores a primal value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an adjoint slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`.

Conversation

duburcqa commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runtime-evaluated adstack sizing on LLVM backends

TL;DR

Why

Entry point

Mechanism end-to-end

1. Compile-time SizeExpr capture

2. Codegen reads stride / offset / max-size from runtime metadata

3. Host evaluator

4. Per-kernel metadata bridge

Per-backend coverage matrix

Tests

tests/python/test_adstack.py

tests/cpp/transforms/determine_ad_stack_size_test.cpp

Side-effect audit

Runtime-evaluated SizeExpr for SPIR-V adstack sizing

TL;DR

Why

Entry point

Mechanism end-to-end

1. Codegen: serialise per-alloca metadata into TaskAttributes

2. Codegen: runtime loads at every adstack site

3. Sizer shader capability gate + scope-state safety

4. Runtime: upfront metadata evaluation + buffer upload

5. Program <-> runtime plumbing

Per-backend coverage matrix

Tests

New tests

Widened arch coverage

Known coverage gaps

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

duburcqa commented Apr 22, 2026 •

edited

Loading

1. Compile-time `SizeExpr` capture

`tests/python/test_adstack.py`

`tests/cpp/transforms/determine_ad_stack_size_test.cpp`

1. Codegen: serialise per-alloca metadata into `TaskAttributes`

duburcqa Apr 25, 2026 •

edited

Loading

duburcqa Apr 25, 2026 •

edited

Loading

duburcqa Apr 25, 2026 •

edited

Loading

hughperkins Apr 25, 2026 •

edited

Loading