Skip to content

[AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr#543

Merged
duburcqa merged 37 commits intomainfrom
duburcqa/adstack_spirv_runtime_size
Apr 25, 2026
Merged

[AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr#543
duburcqa merged 37 commits intomainfrom
duburcqa/adstack_spirv_runtime_size

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 22, 2026

Runtime-evaluated adstack sizing on LLVM backends

Makes reverse-mode adstack capacity automatic on CPU / CUDA / AMDGPU: the compiler captures a symbolic upper-bound expression per adstack alloca and the host launcher resizes the per-thread heap just before each dispatch. No user-facing API change; default_ad_stack_size / ad_stack_size remain accepted in qd.init() and the runtime falls back to default_ad_stack_size (with a QD_WARN) when an adstack's enclosing loop shape is outside the SizeExpr grammar. Shapes inside the grammar are sized exactly at launch time.

TL;DR

import quadrants as qd

qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True)

n = qd.field(qd.i32, shape=())
x = qd.field(qd.f32, shape=16, needs_grad=True)
y = qd.field(qd.f32, shape=(), needs_grad=True)

@qd.kernel
def compute():
    for i in x:
        v = x[i]
        for _ in range(n[None]):           # trip count known only at launch time
            v = v * 0.95 + 0.01
        y[None] += v

n[None] = 512
compute()
y.grad[None] = 1.0
compute.grad()                              # no QuadrantsAssertionError: adstack sized to 512 + 2

Before this PR, the inner range(n[None]) forced the CFG analyzer into its default_ad_stack_size=256 fallback, so n[None] = 512 overflowed at the next qd.sync(). After this PR the compiler captures the loop bound as a SizeExpr, the launcher evaluates it against n[None] at launch time, and the backing heap is grown to exactly 512 + 2 slots.

Why

Reverse-mode AD through a dynamic loop requires an adstack whose depth equals the forward trip count plus two setup pushes. Until now Quadrants had to resolve that depth at compile time, so every loop bound that involved a field load (n_bodies, n_dofs, an ndarray shape, an enclosing parallel-for index) fell back to the user-configured default_ad_stack_size. For realistic reverse-mode kernels this is both:

  • Wrong-too-small: a 14-dof mechanism stepping through an ndrange(n_bodies, n_dofs) easily needs >256 slots, triggering a runtime QuadrantsAssertionError: Adstack overflow.
  • Wrong-too-big: bumping the default to cover worst case inflates every adstack in the program, including the ones the compiler could prove were tiny, and scales linearly with the dispatched ndrange on GPU.

The traditional workaround - tune default_ad_stack_size per program - is not tunable enough: you cannot set a different default for every adstack, and bumping it globally multiplies GPU memory by the number of threads. This PR replaces the static fallback with a symbolic upper-bound expression captured at compile time and reduced per launch against the live field / ndarray state, so each adstack gets exactly the depth that dispatch needs.

Entry point

LlvmRuntimeExecutor::publish_adstack_metadata(const AdStackSizingInfo &ad_stack, std::size_t num_threads, LaunchContextBuilder *ctx, void *device_runtime_context_ptr = nullptr) at quadrants/runtime/llvm/llvm_runtime_executor.cpp. Called from cpu/cuda/amdgpu/kernel_launcher.cpp right before every task's dispatch. It:

  1. Walks the task's size_exprs vector (post-order SerializedSizeExpr trees, one per AdStackAllocaStmt).
  2. Evaluates each tree via adstack_size_expr_eval.cpp, reading scalar field values through SNodeRwAccessorsBank and launch-context ndarray shapes / scalar reads through the LaunchContextBuilder. The AMDGPU path threads the launcher's device-side RuntimeContext staging copy (kernel_launcher.cpp:185-186) through device_runtime_context_ptr so HIP kernels that resolve ExternalTensorRead leaves against ndarray pointers do not fault on a host pointer; CUDA relies on UVA and passes nullptr; CPU ignores the argument.
  3. Computes per-thread stride, per-alloca offsets, and per-alloca max-sizes.
  4. Writes these into LLVMRuntime::adstack_{per_thread_stride,offsets,max_sizes} via runtime_get_adstack_metadata_field_ptrs so generated code reads them at every push / load-top site.
  5. Grows the per-thread heap backing buffer via ensure_adstack_heap(per_thread_stride * dispatched_threads).

Shapes the grammar covers are sized exactly. Shapes outside the grammar (e.g. a range(field[i]) / range(ndarray[i]) indexed by a loop-carried variable that the structural walker does not fold) fall back to default_ad_stack_size and emit a QD_WARN naming the source location so the user can either restructure the loop or extend the grammar.

Mechanism end-to-end

1. Compile-time SizeExpr capture

quadrants/transforms/determine_ad_stack_size.cpp runs a Bellman-Ford-first, structural-walker-second pre-pass that attempts to express each AdStackAllocaStmt's max depth as a SizeExpr tree. Grammar covered:

Leaf Source Lowering
Const integer literal SizeExpr::make_const
FieldLoad GlobalLoadStmt of a scalar i32 / i64 SNode make_field_load with the SNode pointer + constant index path
NdarrayShape ExternalTensorShapeAlongAxisStmt make_ndarray_shape(arg_id, axis)
NdarrayRead scalar read from an integer ndarray make_ndarray_read(arg_id, index_path)
LoopIndexMax LoopIndexStmt of an enclosing bounded range-for / parallel-for make_loop_index_max (bounded by the loop's end or a max over every push site)
StackLoadTopMax AdStackLoadTopStmt of a stash adstack make_stack_load_top_max (bounded by a max over every push site)

Inner operators: add, sub, mul (const * anything via Add-replication), max. expr_sub's MaxOverRange fusion caps the fused iteration range at min(shape_a, shape_b) (expressed in-grammar as a + b - max(a, b)) so the fused body cannot read the shorter ndarray past its buffer end when a caller passes mismatched-length ndarrays. Anything else falls back to default_ad_stack_size and emits the QD_WARN noted above. SizeExpr objects are immutable after the pre-pass writes them and are serialised as post-order SerializedSizeExpr::nodes so offline-cache hits evaluate the same way as fresh compiles.

2. Codegen reads stride / offset / max-size from runtime metadata

File Change
codegen_llvm.cpp AdStackAllocaStmt base addr = heap + linear_tid * runtime->adstack_per_thread_stride + runtime->adstack_offsets[stack_id] (all three loaded at runtime)
codegen_llvm.cpp AdStackPushStmt stack_push(runtime, stack, runtime->adstack_max_sizes[stack_id], element_size) (cap is a runtime load, not an IR immediate)
init_offloaded_task_function pre-scan assigns AdStackAllocaStmt::stack_id in declaration order; ad_stack_offsets_ became std::vector<std::size_t> indexed by stack_id

The heap base address is unchanged; only the per-thread stride and per-alloca offset math was moved from codegen-time immediates to runtime loads, so kernels compiled with the old flow will recompile but do not break semantics.

Note: default_ad_stack_size and ad_stack_size are ignored on LLVM for adstacks whose bound is inside the grammar. They remain in qd.init() for SPIR-V backwards compatibility and for the warn-and-fallback path where the pre-pass cannot resolve the loop shape symbolically.

3. Host evaluator

adstack_size_expr_eval.{h,cpp} walks the flat post-order tree:

  • Field loads go through the existing SNodeRwAccessorsBank to read the current scalar value on the host side.
  • Ndarray shape / scalar reads go through LaunchContextBuilder which captures both the pointer and the shape on kernel launch.
  • LoopIndexMax and StackLoadTopMax fold to whichever launch-time bound they were captured against.
  • Arithmetic is saturating int64_t.

The evaluator runs strictly on the host, before the kernel dispatches, so there is no device-side evaluator to keep in sync.

4. Per-kernel metadata bridge

runtime_get_adstack_metadata_field_ptrs(LLVMRuntime *) in runtime.cpp returns pointers to adstack_{per_thread_stride,offsets,max_sizes}. The host writes into these fields once per dispatch. On CPU the fields are read directly from LLVMRuntime; on CUDA / AMDGPU the bridge triggers a memcpy_host_to_device so the generated code's subsequent loads see the right values. The memcpy is gated on the task actually carrying an adstack: kernels with no reverse-mode adstacks skip publish_adstack_metadata entirely and pay no per-launch cost.

Per-backend coverage matrix

Backend Pre-pass Metadata bridge Launcher hook Status
CPU (LLVM) ok direct field access cpu/kernel_launcher.cpp auto-sized inside the grammar; default_ad_stack_size fallback + QD_WARN outside
CUDA (LLVM) ok memcpy_host_to_device cuda/kernel_launcher.cpp same
AMDGPU (LLVM) ok memcpy_host_to_device (host pointer routed through the launcher's device-side RuntimeContext staging copy) amdgpu/kernel_launcher.cpp same
Metal (SPIR-V) N/A here N/A here N/A here untouched - still compile-time, handled by the stacked SPIR-V sizer PR
Vulkan (SPIR-V) N/A here N/A here N/A here untouched - still compile-time, handled by the stacked SPIR-V sizer PR

Tests

tests/python/test_adstack.py

  • test_adstack_field_load_bounded_loop_evaluated_per_launch - pins the core guarantee: a kernel with for _ in range(n[None]) bound by an i32 scalar field. Launches at n_iter in (1, 20, 50) and asserts the heap resized correctly (no overflow at 50, no over-allocation at 1), with the decorator set to default_ad_stack_size=2 to prove the knob is bypassed when the grammar resolves the bound symbolically.
  • test_adstack_near_capacity - rewritten to exercise both sides of the old default_ad_stack_size=32 boundary at n_iter in (30, 100) without any knob override. Arch-restricted to qd.cpu until the SPIR-V sizer PR lands.
  • test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends / ..._with_mismatched_shape_ends - pin the expr_sub MaxOverRange fusion for both the strict-equality and the cross-ndarray same-axis paths.
  • test_adstack_fuses_sub_of_max_over_range_with_mismatched_lengths_is_safe - pins the min(shape_a, shape_b) end cap so the cross-ndarray fusion stays in-bounds when the two ndarrays have different lengths at launch time (pre-fix the fused body read the shorter ndarray past its buffer end).
  • test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index / test_adstack_inner_range_bounded_by_multidim_ndarray_read / test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var - cover the ExternalTensorRead grammar leaves, including multi-axis stride handling and the stash-then-reload pattern.

tests/cpp/transforms/determine_ad_stack_size_test.cpp

  • SizeExprBuildFromRangeFor - const bounds fold to a Const leaf.
  • SizeExprBuildFromFieldLoad - scalar i32 field load builds a FieldLoad leaf with the right SNode + index path.
  • SizeExprBuildFromLoopIndex - inner for j in range(i) under an enclosing bounded loop folds to LoopIndexMax.
  • SizeExprRejectsUnsupportedOp - negative control that an unsupported operator routes through the warn-and-fallback path.

Deliberate coverage gap: no SPIR-V cross-check here - the SPIR-V runtime metadata path lives in the stacked SPIR-V sizer PR and its tests cover the shader-side reading.

Side-effect audit

Concern Where checked Verdict
Offline cache key quadrants/analysis/offline_cache_util.cpp serialises SerializedSizeExpr::nodes alongside the existing size_in_bytes per AdStackAllocaStmt ok - cache-hit reloads the same tree; evaluator runs the same on fresh + cache-hit. Empty-tree caches (pre-PR offline caches) fall back to default_ad_stack_size.
Stmt clone / serialization statements.h: AdStackAllocaStmt::size_expr is a std::shared_ptr<SizeExpr> cloned via the default QD_DEFINE_ACCEPT_AND_CLONE ok - SizeExpr is immutable post pre-pass; shared pointer is safe to share across clones
IR printer ir/ir.cpp AdStackAllocaStmt::repr prints stack_id and a short SizeExpr summary ok - IR dump readable
Codegen non-feature path AdStackAllocaStmt::stack_id == -1 falls through to the old default_ad_stack_size-backed path ok - kernels compiled without the pre-pass writing a SizeExpr still behave as before
Whole-kernel CSE SizeExpr is stored on AdStackAllocaStmt, not emitted as IR operands, so it does not participate in CSE ok
LaunchContextBuilder extended to capture scalar + shape for every ndarray arg; no change to the existing argument-pack layout - new entries appended ok - old kernels still build their context the same way
Runtime initialization adstack_{per_thread_stride,offsets,max_sizes} initialised to zero / null in runtime_initialize; kernels with no adstacks read nothing from them and skip publish_adstack_metadata entirely ok - zero-adstack kernels pay nothing
Multi-threaded CPU launch metadata is written once per dispatch before the thread pool starts; no concurrent writers ok
CUDA / AMDGPU memcpy_host_to_device one memcpy per adstack-bearing task per launch; measured overhead ~microseconds. Forward-only kernels and reverse kernels whose pre-pass resolves all bounds at compile time are exempt. acceptable; would be visible only on workloads that launch thousands of tiny autodiff kernels per second
Mismatched-length ndarray fusion expr_sub's cross-ndarray fusion caps the fused end at min(shape_a, shape_b) ok - pre-fix UB on mismatched shapes; post-fix both reads stay in-bounds

Runtime-evaluated SizeExpr for SPIR-V adstack sizing

Stacked on #550. No behaviour change for kernels that do not opt into the experimental adstack path (ad_stack_experimental_enabled=True). Mirrors #550's LLVM runtime-metadata plumbing onto SPIR-V so Metal and Vulkan kernels size their per-dispatch adstack heaps from launch-time-evaluated SizeExpr trees, tightening the shader-baked per-thread stride / offset / max_size from compile-time immediates to runtime loads out of a new AdStackMetadata buffer. Also tightens the #550-layer warn-and-fallback into a QD_ERROR on unresolved grammar shapes so both backends converge on a single "inside-grammar is sized exactly, outside-grammar is a compile error" contract.

TL;DR

@qd.kernel
def compute(a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for i in x:                                    # outer struct-for
        v = x[i]
        n = a[i]                                    # ndarray-read-at-loop-index bound
        for _ in range(n):                          # dynamic inner range
            v = v * 0.95 + 0.01
        y[None] += v

On LLVM backends (#550) the reverse-mode adstack for v is already sized per-launch from the evaluated SizeExpr tree. This PR extends the same behaviour to the Metal and Vulkan backends: compute.grad(a) now sizes the per-thread heap to exactly n = a[i] slots per outer iteration on SPIR-V too, where before it either baked the compile-time fallback into the shader or hit a max_size > 0 codegen assert on dynamic bounds.

Why

#550 landed SizeExpr-at-launch-time sizing for LLVM and seeded a default_ad_stack_size compile-time fallback into the SPIR-V immediates so the existing SPIR-V codegen kept asserting cleanly. That seed is safe but wasteful: the SPIR-V shader's per-thread heap slice still has to be provisioned at the worst-case default_ad_stack_size for every alloca that was not statically const-foldable, even when the live field values would give a much tighter bound at launch. Kernels like the one above allocate n_outer_iters * default_ad_stack_size * 8 bytes per loop-carried variable on the device; at default_ad_stack_size=256 with a few dozen allocas that is tens of MB per thread on Metal - crossing the driver's maxBufferLength on anything larger than a toy workload.

The LLVM infrastructure is a complete blueprint: AdStackAllocaStmt::size_expr is captured during determine_ad_stack_size, OffloadedTask::ad_stack.allocas[i].size_expr persists through the offline cache, and LlvmRuntimeExecutor::publish_adstack_metadata evaluates each tree right before each dispatch. What was missing on SPIR-V was the final per-launch metadata hand-off to the shader. This PR adds it: one new BufferType::AdStackMetadata SSBO, one new TaskAttributes::ad_stack struct carrying per-alloca metadata plus serialized SizeExpr, one GfxRuntime::launch_kernel prelude that evaluates the trees and writes the buffer, and codegen rewrites every AdStackAllocaStmt / Push / LoadTop / AccAdjoint site to read its stride / offset / max_size from that buffer instead of emitting compile-time immediates.

This PR also tightens the grammar-unresolved path on both backends: where #550 falls back to default_ad_stack_size with a QD_WARN, this PR flips the fallback to a QD_ERROR so any adstack outside the grammar is a compile-time error naming the source location. Shapes inside the grammar are sized exactly per launch; shapes outside the grammar are rejected; the ad_stack_size=N escape hatch in qd.init() is retained for stress tests and for bypassing the sizer when debugging.

Entry point

The adstack pre-pass determine_ad_stack_size already runs for all backends (gated on autodiff_mode == kReverse && ad_use_stack) and populates AdStackAllocaStmt::size_expr; this PR only wires SPIR-V codegen and the GfxRuntime launcher to honour that metadata. The relevant files:

  • quadrants/codegen/spirv/kernel_utils.h - adds TaskAttributes::AdStackSizingAttribs / AdStackAllocaAttribs (replaces the scalar ad_stack_heap_per_thread_stride_{float,int} fields) and BufferType::AdStackMetadata.
  • quadrants/codegen/spirv/spirv_codegen.cpp - serialises each alloca's size_expr into task_attribs_.ad_stack.allocas; emits runtime loads for stride / offset / max_size via ensure_ad_stack_metadata_loaded.
  • quadrants/codegen/spirv/adstack_sizer_shader.cpp / .h - the on-device compute shader that evaluates each task's SizeExpr bytecode tree and writes per-task stride / offset / max_size into the AdStackMetadata buffer.
  • quadrants/runtime/gfx/runtime.cpp::GfxRuntime::launch_kernel - evaluates every task's SizeExpr tree upfront (before the cmdlist opens, to serialise against any reentrant SNodeRwAccessorsBank reader-kernel launches), then uploads per-task metadata to the AdStackMetadata buffer and binds it alongside the existing heap buffers.
  • quadrants/runtime/gfx/runtime.h::GfxRuntime::Params::program_impl - propagates the ProgramImpl back-reference so evaluate_adstack_size_expr(...) can reach ProgramImpl::program (same pattern the LLVM side uses).

Mechanism end-to-end

1. Codegen: serialise per-alloca metadata into TaskAttributes

Layer File What's new
Per-alloca attribs codegen/spirv/kernel_utils.h AdStackAllocaAttribs { heap_kind, offset_in_elems_compile_time, max_size_compile_time, SerializedSizeExpr size_expr }
Per-task attribs codegen/spirv/kernel_utils.h AdStackSizingAttribs { per_thread_stride_{float,int}_compile_time, allocas[] } (replaces the two scalar ad_stack_heap_per_thread_stride_* fields)
Capture at alloca site codegen/spirv/spirv_codegen.cpp::visit(AdStackAllocaStmt) attribs.size_expr = stmt->size_expr->serialize() when the pre-pass populated it

The pre-scan that sums per-type strides stays put; it just seeds the compile-time fallback values that the host uses when size_expr is empty (offline-cache-hit path without a serialised tree, i.e. pre-PR offline caches).

2. Codegen: runtime loads at every adstack site

Site Before After
invoc_id * stride at get_ad_stack_heap_thread_base_{float,int} OpIMul(invoc_id, uint_immediate(stride)) OpIMul(invoc_id, OpLoad(metadata_buffer[0 or 1]))
Per-alloca offset at ad_stack_heap_{float,int}_ptr uint_immediate(heap_primal_offset) OpLoad(metadata_buffer[2 + 2*stack_id]) (cached per alloca on info.offset_val)
max_size at AdStackPushStmt and ad_stack_top_index uint_immediate(stmt->max_size) / uint_immediate(max_size - 1) OpLoad(metadata_buffer[2 + 2*stack_id + 1]) (cached on info.max_size_val) + one OpISub for the max - 1 cap
Adjoint offset for f32 adstacks uint_immediate(primal + max_size) OpIAdd(info.offset_val, info.max_size_val) - one derived add instead of a second buffer load

Each AdStackSpirv::info caches its SSA ids for offset_val / max_size_val / adjoint_offset_val on first use and reuses them at every push / load / load_top_adj / acc_adjoint site, so each alloca pays one metadata read regardless of how many push/pop sites it has in the shader body.

Note: the invoc_id * stride OpIMul stays eagerly emitted at the first alloca site (not lazily at the first push) so it dominates every sibling inner-loop body - the same SPIR-V dominance rule #493's original heap-backing work already enforced. The only change is the stride now comes from OpLoad(metadata[0]) instead of uint_immediate(...).

3. Sizer shader capability gate + scope-state safety

build_adstack_sizer_spirv hard-errors through the empty-return path unless the device advertises spirv_has_physical_storage_buffer, spirv_has_int64, spirv_has_int8, and spirv_has_int16. The emit_psb_load_i64 per-dtype switch calls ir.i{8,16}_type() unconditionally - without those capabilities the accessors return default-constructed SType(id=0) and spirv-val rejects the binary - so gating on all four surfaces a clear "legacy device missing a required hardware feature" diagnostic at launcher time instead of an invalid-bytecode rejection at pipeline creation. GfxRuntime::launch_kernel mirrors the gate so the launcher's error message and the shader builder's return agree.

The shader body also clears scope[pending_var_id] to zero on every MaxOverRange pop (done_lbl). scope_arr is a single function-storage alloca zero-initialised once at main() entry, but var_id_counter resets per alloca, so different stacks in the same task reuse scope[0], scope[1], and so on. Without the pop-path clear, the NEXT stack's outer linear pre-order walk reads the previous stack's terminal bound value as an ExternalTensorRead index into a potentially-smaller ndarray - out-of-bounds on Metal (hung command buffer) or silently-zero on Vulkan with robustBufferAccess. The clear preserves the "index 0 is always valid for any non-empty ndarray" invariant the outer walk relies on.

4. Runtime: upfront metadata evaluation + buffer upload

GfxRuntime::launch_kernel evaluates every task's SizeExpr trees before ensure_current_cmdlist() opens the per-launch cmdlist. This is the load-bearing ordering change: the evaluator can call SNodeRwAccessorsBank::read_int(...) to resolve a field-load-bounded expression, which itself launches a reader kernel through GfxRuntime::launch_kernel - recursing into this same function with the outer cmdlist half-built would stomp current_cmdlist_. Upfront evaluation keeps the reader kernels serialised against any in-flight cmdlist through the normal submit/synchronize path.

The metadata buffer is allocated per task (not pooled across the launch) so a sibling task's host memcpy cannot race-overwrite this task's values between cmdlist record and cmdlist submit. Each per-task buffer is retired into ctx_buffers_ so it stays alive for the full submit/sync window. Layout:

u32[0]                 = stride_float
u32[1]                 = stride_int
u32[2 + 2*i + 0]       = alloca i's offset in its heap (elements)
u32[2 + 2*i + 1]       = alloca i's max_size

Host write happens through a device_->map_range(...) memcpy; unmap before bind. The shader reads via the same BufferType::AdStackMetadata access-chain plumbing as every other StorageBuffer in the task's bind set.

5. Program <-> runtime plumbing

GfxRuntime::Params gains a ProgramImpl *program_impl field, set by MetalProgramImpl::materialize_runtime and VulkanProgramImpl::materialize_runtime to this. The runtime uses it to reach program_impl_->program for evaluate_adstack_size_expr(...). Program::Program already sets program_impl_->program = this post-materialize_runtime, so the back-reference is non-null by the time any launch_kernel runs.

Per-backend coverage matrix

Backend Pre-pass (from #550) Codegen runtime loads Per-launch evaluation Heap grow on stride change
CPU (LLVM) yes yes (from #550) yes (from #550) yes (from #550)
CUDA (LLVM) yes yes (from #550) yes (from #550) yes (from #550)
AMDGPU (LLVM) yes yes (from #550) yes (from #550) yes (from #550)
Metal (SPIR-V) yes yes (this PR) yes (this PR) reused from #493's heap-grow path
Vulkan (SPIR-V) yes yes (this PR) yes (this PR) reused from #493's heap-grow path

The SPIR-V backends also continue to honour the max_size_compile_time fallback on the offline-cache-hit path where the symbolic SizeExpr tree was not serialised (empty size_expr.nodes); the runtime falls back to it verbatim, clamping to >= 1 like the LLVM counterpart does.

Tests

New tests

  • test_adstack_bounded_inner_loop_sized_by_structural_prepass (tests/python/test_adstack.py): pins the SPIR-V structural pre-pass correctness - a kernel whose inner range-for trip count is a constant product that the pre-pass must fold while the CFG analyzer would fall back to a pessimistic default. Arch-restricted to Metal / Vulkan because the LLVM backend routes the same kernel through the symbolic-tree path instead.
  • test_adstack_spirv_metadata_per_task_buffer (tests/python/test_adstack.py, cfg_optimization=False): pins the per-task AdStackMetadata buffer fix - with a shared buffer the second offload's host memcpy overwrites the first offload's metadata before submit, causing an Adstack overflow (offending stack_id=0) at qd.sync() even though each stack's sizer-computed bound is correct. cfg_optimization=False is load-bearing; it keeps the two sibling offloads separate so the record-then-execute race surfaces.
  • AdStackSizerShader.GateReturnsEmptyWhenRequiredCapIsMissing (tests/cpp/codegen/adstack_sizer_shader_test.cpp): pins the capability gate - dropping any one of PSB, Int64, Int8, Int16 from a synthetic caps config flips build_adstack_sizer_spirv to the empty-return path.
  • AdStackSizerShader.DumpBinary (tests/cpp/codegen/adstack_sizer_shader_test.cpp): sanity-only smoke test that dumps the compiled binary to /tmp/adstack_sizer.spv for local spirv-val / spirv-dis use.

Widened arch coverage

Existing tests that exercised the runtime-evaluated adstack on LLVM only now run on SPIR-V too (test_adstack_field_load_bounded_loop_evaluated_per_launch, test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index, test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var, test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends, test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_mismatched_shape_ends) by dropping the arch=[qd.cpu, ...] restriction that scoped them to the LLVM sizer path.

Known coverage gaps

No dedicated Python regression test for the scope[pending_var_id] pop-path clear. The failure mode is hardware-dependent: on Metal / MoltenVK and Vulkan-with-robustBufferAccess the spurious pre-pass out-of-bounds read is either caught silently or returns zero, both of which pending_max_accum (which starts at 0 at every MOR push) absorbs without surfacing. A test that fails deterministically would need a Vulkan device without robust buffer access or a simulated evaluator. The fix is mechanism-verified through code review and covered indirectly by the full adstack suite running through the sizer.

Full suite: pytest tests/python/test_adstack.py -n 8 -> 712 passed, 7 xfailed on macOS arm64 (CPU + Metal + Vulkan / SwiftShader); wider backend coverage runs through the Linux AMDGPU box.

Side-effect audit

Concern Where checked Verdict
Offline cache key TaskAttributes::QD_IO_DEF(..., ad_stack) via the new AdStackSizingAttribs::QD_IO_DEF(per_thread_stride_{float,int}_compile_time, allocas) and AdStackAllocaAttribs::QD_IO_DEF(heap_kind, offset_in_elems_compile_time, max_size_compile_time, size_expr) Safe - any schema change (field rename / reordering) invalidates the cache at the serializer level, same as every other TaskAttributes change. Pre-PR caches without the new fields are invalidated on first load and recompiled.
Buffer binding numbering TaskCodegen::binding_head_ unchanged; AdStackMetadata is appended lazily via get_buffer_value(...) like every other buffer kind, so existing bindings (Args, Rets, Root, ExtArr, AdStackOverflow, AdStackHeap*) keep their bindings Safe - the metadata buffer lands at a fresh binding slot per task.
Tasks with no adstack attribs.ad_stack.allocas.empty() short-circuits both the runtime evaluation loop and the metadata buffer allocation; codegen never calls get_ad_stack_metadata_buffer() in that case, so no BufferType::AdStackMetadata entry is added to buffer_binds Safe - forward-only kernels and reverse kernels with only const-resolved allocas are unaffected.
Mid-cmdlist recursion Metadata evaluation moved to before ensure_current_cmdlist() in launch_kernel Safe - SNodeRwAccessorsBank::read_int reader-kernel launches now go through the normal submit/sync path; no partially-built cmdlist to stomp.
Legacy max_size immediates The pre-scan still computes ad_stack_heap_per_thread_stride_{float,int}_ as a compile-time sum and carries them into AdStackSizingAttribs::per_thread_stride_{float,int}_compile_time; the runtime uses those when every alloca's size_expr.nodes.empty() (pre-PR offline cache hit) Safe - the cache-hit path reproduces the pre-PR shader layout.
Dominance of invoc_id * stride get_ad_stack_heap_thread_base_{float,int} still eagerly emits at the first alloca site, just with OpLoad of the metadata stride instead of uint_immediate Safe - the SSA id is defined in the same block that dominates every downstream push/load-top body.
Per-task metadata buffer lifetime Each task record allocates a fresh AdStackMetadata buffer and retires the previous one into ctx_buffers_ Safe - the buffer stays alive through the submit/sync window; no host memcpy can overwrite a sibling task's values before its dispatch runs.
Cross-stack scope_arr reuse MOR done_lbl clears scope[pending_var_id] to zero Safe - the next stack's outer pre-pass reads land on index 0, which is valid for any non-empty ndarray (a grammar invariant).
Non-ASCII characters adstack_sizer_shader.cpp comments use ASCII arrows (->) only Safe - no smart quotes / em-dashes / Unicode in committed artifacts.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86fb79c378

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
@duburcqa duburcqa changed the title [AutoDiff] Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path [AutoDiff] Autodiff 16: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path Apr 22, 2026
@duburcqa duburcqa marked this pull request as draft April 22, 2026 05:16
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 86fb79c to e38637b Compare April 22, 2026 05:20
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@duburcqa duburcqa changed the title [AutoDiff] Autodiff 16: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path [AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path Apr 22, 2026
@duburcqa duburcqa force-pushed the duburcqa/runtime_adstack_capacity branch from 853b28d to 7e26214 Compare April 22, 2026 22:56
@duburcqa duburcqa changed the base branch from duburcqa/runtime_adstack_capacity to duburcqa/moltenvk_sdk_source April 22, 2026 22:56
@duburcqa duburcqa changed the title [AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V adstack sizing, mirroring the LLVM path [AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V Apr 22, 2026
@duburcqa duburcqa changed the title [AutoDiff] Autodiff 17: Runtime-evaluated SizeExpr for SPIR-V [AutoDiff] Autodiff 17: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr for SPIRV Apr 22, 2026
@duburcqa duburcqa changed the title [AutoDiff] Autodiff 17: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr for SPIRV [AutoDiff] Autodiff 17: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr on SPIRV backends (Metal/Vulkan) Apr 22, 2026
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 24f73f5 to 453d0a8 Compare April 23, 2026 05:28
Base automatically changed from duburcqa/moltenvk_sdk_source to duburcqa/runtime_adstack_capacity April 23, 2026 05:28
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from e38637b to 16acc6b Compare April 23, 2026 05:28
@duburcqa duburcqa force-pushed the duburcqa/runtime_adstack_capacity branch from bc3bf3f to 453d0a8 Compare April 23, 2026 05:31
@duburcqa duburcqa changed the base branch from duburcqa/runtime_adstack_capacity to duburcqa/adstack_llvm_runtime_size April 23, 2026 05:34
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 16acc6b to 6d6b0d1 Compare April 23, 2026 05:43
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from bc3bf3f to b757b5a Compare April 23, 2026 05:43
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 6d6b0d1 to bd11a8a Compare April 23, 2026 05:54
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from b757b5a to 4b1b402 Compare April 23, 2026 05:54
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from bd11a8a to 08789d5 Compare April 23, 2026 06:07
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch 2 times, most recently from babc027 to 5182192 Compare April 23, 2026 06:12
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 08789d5 to 87a2224 Compare April 23, 2026 06:12
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 5182192 to f5afee0 Compare April 23, 2026 07:57
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch from 87a2224 to deb7004 Compare April 23, 2026 07:57
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from f5afee0 to b472593 Compare April 23, 2026 07:59
@duburcqa duburcqa force-pushed the duburcqa/adstack_spirv_runtime_size branch 3 times, most recently from 5b3a7ed to 1c28866 Compare April 23, 2026 09:33
duburcqa added 10 commits April 24, 2026 23:36
… the AllocaStmt site so they dominate every sibling push / load-top / acc-adjoint block
…lag, symmetric d2h grad gate, IRBuilder::cast skips OpBitcast(T,T), compute_dense_snode_strides refuses multi-child parents, kMaxNodes host-side guard, restore mismatched-lengths fusion regression, scrub remaining external-project references
…me/gfx/adstack_sizer_launch.cpp so runtime.cpp stays focused on the main-kernel record/submit flow
…reter switch as a no-op so -Wswitch stays silent; the LLVM encoder always host-folds FieldLoad leaves, so this arm is unreachable but keeps the enum coverage exhaustive
…iness and widen Supported loop shapes table from the test suite
…k bytecode offsets up to 256B for VUID-02999-safe storage-buffer descriptor bindings, bump kMaxVars to kAdStackSizeExprDeviceMaxBoundVars (32) and expose kAdStackSizerMaxPendingFrames so the host encoder hard-errors before the shader's fixed-size pending-frame stack can OOB, clamp pending_end_arr to begin+(1<<24) so MaxOverRange pathological ranges silently break matching the LLVM interpreter and host evaluator
…AD | WRITE) so a kernel that only non-atomic-writes .grad does not leave allocator garbage on device for d2h to blit back, and update kExternalTensorRead comment in adstack_size_expr_device.h to match the (idx, stride) pair layout the encoder actually emits
… alongside the SPIR-V-scope tests: parametrize the mismatched-lengths test with a peak-past-shape-b case that exercises the cross-ndarray independent-loops fusion path, and add a fld[i]-arr[i] test that guards against the fusion synthesizing a mixed FieldLoad+ExternalTensorRead body the LLVM-GPU encoder cannot lower
…ack sizer dispatch via track_physical_buffer, mirroring what the main-kernel dispatch already does at runtime.cpp's pre-dispatch block; without the useResource: hint the Apple7 GPU family (M1) returns zero for the sizer shader's PSB load of the ndarray pointer through the kernel arg buffer and every MaxOverRange-over-ExternalTensorRead collapses to zero, tripping an Adstack overflow at the next qd.sync(); non-Metal backends no-op the base track_physical_buffer and pay nothing
Comment thread docs/source/user_guide/autodiff.md Outdated
**Problem.** Reverse-mode AD through a dynamic loop (one whose trip count is not known at compile time) needs to recover the primal value at each iteration when walking the loop backwards. Without that, the chain-rule steps read a stale value and the gradients come out silently wrong. Static-unrolled (`qd.static(range(...))`) loops are not affected because every iteration becomes its own inlined block at compile time.

**How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. adstack is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.
**How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes one entry per iteration. The reverse pass walks the stack from top down: at each reverse iteration it reads the current top entry as many times as that iteration's chain-rule contributions need (one `read` per downstream use of the primal), then pops the entry once and steps to the iteration underneath. Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a `v` that appears in both a `sin(v)` and a subsequent `v * w` needs to be visible to both adjoint terms before it is discarded. Enabling adstack costs extra per-thread memory and compile time, but some kernels need it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"from top down: at each reverse " => "from top down. At each reverse "

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(one read per downstream use of the primal)"

=> move to a new sentence after the current one, (so we don't need to pusht he current sentence onto the stack whilst reading, then read the parentheses, then pop the ucrrent setnecne back off the stack in our mind, which makes things harder to read I feel)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a v that appears in both a sin(v) and a subsequent v * w needs to be visible to both adjoint terms before it is discarded. " => This seems like an implementation detail, not needed by the end-user? => "under the hood" style section

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The flag is compile-time, so it must be set before the offending kernel compiles. " => this seems superfluous? Remove, or move to 'under the hood' (you are justifying your impelmetnation design I feel, rather than telling the user what they need to know in order to use the feature).

Comment thread docs/source/user_guide/autodiff.md Outdated
**How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes one entry per iteration. The reverse pass walks the stack from top down: at each reverse iteration it reads the current top entry as many times as that iteration's chain-rule contributions need (one `read` per downstream use of the primal), then pops the entry once and steps to the iteration underneath. Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a `v` that appears in both a `sin(v)` and a subsequent `v * w` needs to be visible to both adjoint terms before it is discarded. Enabling adstack costs extra per-thread memory and compile time, but some kernels need it.

**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles.
**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU)." => this seems like details best left for later? (maybe some 'compatibilty' section or simlar?)

Comment thread docs/source/user_guide/autodiff.md Outdated
**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles.
**Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU).

**Note.** Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface" => does this mean the gradients will in fact be correct? or will be silently wrong?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if you see wrong-but-non-zero gradients through a dynamic loop " => how to detect this?

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to detect this

There is no way to detect this I'm afraid. Just running some external finite difference check, or precising turning it on and rerun as a sanity check. We can just say: "Checking that the value of the gradient is unaffected when turning adstack on is the best way to determine whether adstack is necessary."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Checking that the value of the gradient is unaffected when turning adstack on is the best way to determine whether adstack is necessary." Yup, if that's what we have, let's run with that. Maybe loosen "the best way" to "is a reasonable way"


**Note.** Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check.

Reverse-mode AD walks the forward kernel in reverse and applies the chain rule at every op. The chain-rule factor at each op is that op's derivative with respect to its input. For *non-linear* ops (`sin`, `cos`, `exp`, `sqrt`, `tanh`, `pow`, ...) that derivative depends on the input's primal, so the reverse pass needs the primal value that was there on the forward pass. For *linear* ops (addition, subtraction, multiplication by a constant) the derivative is itself a constant and no primal is needed. In a dynamic loop the forward pass writes a different primal at each iteration, so the reverse pass cannot simply re-read the latest value - it needs one per iteration. adstack provides exactly that: a per-iteration stash of the primal.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clear to me 🙌

Comment thread docs/source/user_guide/autodiff.md Outdated
### Supported loop shapes

*You do not need to read this section to use reverse-mode AD. If a kernel exceeds its adstack capacity, Quadrants raises a Python exception at the next `qd.sync()` whose message recommends bumping `default_ad_stack_size` - that is usually enough. Read on only if you hit that overflow, want to understand why, or want to cap the memory footprint explicitly.*
The compiler recognises a set of bound shapes:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just delete this seciton, or move it to some highly advanced seciton.

Users will just expect stuff to work I feel, and anything else is a bug.

What's an example of something that does NOT work currrnelty?

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of something that does NOT work currrnelty?

Actually I would prefer to keep it. We only support a small subset of all the craziness a user may want to do. Here are examples that are not supported:

@qd.kernel
def my_kernel(a)
    for i in range(sqrt(a)):  # on any non-linear transformation plus min/max
        [...]
@qd.kernel
def my_kernel(a, b)
    for i in range(a):
        for j in range(b):
            for k in range(a[i], b[j]):  # Non-matching running variables+indices  
                [...]
@qd.kernel
def my_kernel(a)
    for i in range(a.shape[0]):
        while a[i] < 10:  # Bound that cannot be determined without running the kernel
            a[i] = a[i] + 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so move to a reference section. And state something like "Note that whilst we support many common loop constructs, many constructs which are allowed in the absence of adstack are not currently handled. See Appendix A for supported loop constructs."

Comment thread docs/source/user_guide/autodiff.md Outdated
The compiler recognises a set of bound shapes:

**Tuning.** Two `qd.init()` knobs control adstack sizing, both measured in slots per adstack (not bytes):
| Bound shape | Example |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this table seems like something to go in a reference section.

"The compiler recognises a set of bound shapes (see 'appendix 123' for authorative list). Any loop shape outside this set is rejected at compile time. The error names the offending source line; typical fixes are to restructure the loop into one of the shapes in the appendix, or to file a bug. "

(but see above comment that we should just delete this seciton altogether, or move the entire section to highly advanced seciton)

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would you like to put this appendix? Another markdown page?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the same markdown page.

Comment thread docs/source/user_guide/autodiff.md Outdated
### Under the hood (advanced)

**Sizing rule.** A `K`-iteration dynamic loop consumes `K + 2` slots in each of its adstacks: one slot per forward iteration, plus two setup slots (one for the initial adjoint, one for the primal's starting value). `default_ad_stack_size` is a per-stack slot count, so size it at the worst-case trip count of the deepest unprovable dynamic loop in the program plus 2.
*This section is a glimpse of how the adstack is sized and laid out in memory. You never need to read it to use reverse-mode AD - skip it unless you are curious, debugging an overflow or OOM, or planning to extend the list of supported loop shapes.*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"You never need to read it to use reverse-mode AD " => "never" seems a very strong claim. I'm pretty skeptical of this claim given the constraints on memory, loop structure etc.

I prefer the earlier pharsing. I don't remember what that was.

Oh here we go:

"You do not need to read this section to use reverse-mode AD. Skip past it unless you hit an overflow error on SPIR-V, an out-of-memory error on GPU, or a compile error pointing at an adstack alloca."

so:

  • remove the first sentence, superfluous
  • remove 'never'
  • break second sentence into two
  • basically just use the original paragraph I pasted above :)

Comment thread docs/source/user_guide/autodiff.md Outdated
| `num_buffers` | Number of adstacks the kernel allocates - one per loop-carried variable plus one per dependent branch flag (see [One adstack per variable](#one-adstack-per-variable)). |

On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time:
Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens bool to i32 at storage time, because SPIR-V has no defined layout for OpTypeBool." => I would break this platform-specific stuff into a seprate paragraph somehow.

"Platform-specific notes:

  • even though integer/boolean adstacks do not need an adjoint slot, LLVM backends still carry one for codegen uniformity
  • SPIR-V stores bools using 4 bytes (32 bits)"


On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time:
Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`.

Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Therefore some examples of the size of adstack slots on each platform are as follows:"

Comment thread docs/source/user_guide/autodiff.md Outdated
- Reverse-mode AD does not propagate gradients through integer casts or non-real operations. No error is raised; the gradient simply stops at the cast and silently reads as zero upstream. Cast to `qd.f32` / `qd.f64` before the differentiable section.
- Backward passes on non-trivial kernels run noticeably slower than the corresponding forward pass, sometimes by an order of magnitude on SPIR-V.
- **Loop bounds read from a writable ndarray are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]`, the sizer snapshots `n[j]` at backward-dispatch entry. Any scenario that makes that snapshot diverge from the value the kernel body ends up executing against drives the adstack undersized and the gradient silently comes out zero or wrong. Two patterns to avoid: (a) the same kernel writes `n[j]` before the loop reads it (`for i in range(n[j])` after `n[j] = something`); (b) across kernels, where `kernel_A` writes `n` and `kernel_B.grad()` reads it as a loop bound - the per-launch numpy-ndarray upload can overwrite the device copy with host-side state that differs from what `kernel_A` left behind, and for `qd.ndarray` a missing host-side barrier between the two launches has the same effect. Quadrants does not detect either pattern. Workaround: treat ndarrays the reverse-mode kernel reads as loop bounds as read-only within and across the relevant kernel chain - compute iteration counts once, store them in a dedicated scalar field or ndarray that is never written during the backward path, and iterate over that.
- **When a dynamic-loop reverse kernel fails.** In normal use, enabling the adstack pipeline and running a reverse-mode kernel through a dynamic loop should just work. Two rare situations can make it fail, and it is worth knowing which one you are in:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In normal use, enabling the adstack pipeline and running a reverse-mode kernel through a dynamic loop should just work. Two rare situations can make it fail, and it is worth knowing which one you are in:" => this si superfluous I feel.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these really limitations? It seems like saying "a limitation of numpy tensors is that they cannot be larger than available physical memory."? 🤔

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably go in some "Gotchas" or "What can go wrong" or simliar kind of section. I'm not sure they are limtiations?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the thing is that it can eat memory pretty fast, and what is being allocated and why is not immediately obvious. When you allocate a gigantic numpy array you will notice it, here it is hidden.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that a 'limitation' is some issue with our implementation, rather than a fundamental characteristic of the underlying maths etc.

As suggested, lets have a section on things that can go wrong, that users can consult.

Analogy:

doc 1:
"Limitations:

  • this toaster does not work when it is not plugged into the power"

doc 2:
"Why doesn't my toaster work?

  • check: is it plugged into the power?"

Comment thread docs/source/user_guide/autodiff.md Outdated
- split the inner section into its own `@qd.kernel`;
- pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
- *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a limtiation? Seems more like some kind of performance characteristic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you want to mention this kind of information if not in limitations?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Performance characteristics"?

"Performance"?

Comment thread docs/source/user_guide/autodiff.md Outdated
- pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
- *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads.
- **Backward passes are slower than forward passes.** In particular, SPIR-V backward passes can be an order of magnitude slower than the forward pass.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isnt really a limitation I feel. In general, in the convnet world at least, backward passes are matehtmatically twice as long as forward passes. Thtas not really a limtation. It just cmes out of the maths.

Maybe just limit to "SPIR-V backward passes can be an order of magnitude slower than the forward pass" (maybe say why?)

Comment thread docs/source/user_guide/autodiff.md Outdated
- *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads.
- **Backward passes are slower than forward passes.** In particular, SPIR-V backward passes can be an order of magnitude slower than the forward pass.
- **Integer casts stop gradients.** Reverse-mode AD does not propagate gradients through integer casts or non-real operations: integers have no meaningful derivative, so the gradient reads as zero upstream of the cast - no error, just silently-zero gradients. Rules of thumb:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really a limitation? This is just a characteristc of differentiation I think? You cannot pass a gradient through a discretization step. (at least, not without using approximatinos, such as Gumbo etc).

Comment thread docs/source/user_guide/autodiff.md Outdated
- keep differentiable variables in `qd.f32` / `qd.f64` through the full forward chain;
- casting *to* a float is safe - the downstream float section remains differentiable, and the cast itself contributes a unit factor to the chain rule;
- casting *back to* an integer stops the gradient at that point, so only do it at integer-indexing sites, after any arithmetic whose gradient you need.
- **Loop bounds that read an ndarray written to by the same kernel are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]` and the same kernel also writes to `n[j]` before that loop runs (`for i in range(n[j])` after `n[j] = something`), the computed gradient may silently come out zero or wrong. Quadrants does not currently detect this. Workaround: put iteration counts in a separate ndarray (or scalar field) the kernel only reads, or split the write and the differentiable loop into two `@qd.kernel` calls. The same caveat applies across kernels - a kernel that writes an ndarray read as a loop bound by a later `.grad()` call needs the host-side contents of that ndarray to already hold the value the backward kernel will execute against.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me from skim-reading whether this is a limitation of our implemetation, or just a fundamental charcteristic that falls out of the maths. I'm guessing the former. Can you clarify in reply to this comment please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know exactly. What I know for sure is that I found it very unexpected when I discovered this limitation and that it is not something uncommon. We need this in Genesis typically. I would be surprised if such limitation exists in Torch, otherwise more people would compute complete wrong gradient x) The biggest issue is that it fails silently.

…section: split "Known limitations" into "What can go wrong" + "Performance characteristics", move the supported-loop-shapes table to Appendix A with a scoped title, relocate "Integer casts stop gradients" to the main reverse-mode section as a property of differentiation, rewrite the mutated-bound bullet against the verified sizer-reads-twice mechanism with a Torch contrast, strip OpTypeBool/alloca/compile-time-synthesized jargon, anchor SPIR-V GPU compatibility to late-2018 hardware; add test_adstack_sizer_trip_count_qd_ndarray_mutated_by_separate_kernel as a strict xfail pinning the cross-kernel pattern verified failing on cpu/metal/vulkan
Comment on lines +330 to +352
int32_t compute_max_mor_nesting(const SerializedSizeExpr &expr) {
std::vector<int32_t> depth(expr.nodes.size(), -1);
std::function<int32_t(int32_t)> visit = [&](int32_t i) -> int32_t {
if (i < 0 || static_cast<std::size_t>(i) >= expr.nodes.size())
return 0;
if (depth[i] >= 0)
return depth[i];
const auto &n = expr.nodes[i];
int32_t child_max = 0;
for (int32_t c : {n.operand_a, n.operand_b, n.body_node_idx}) {
if (c >= 0)
child_max = std::max(child_max, visit(c));
}
int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0;
depth[i] = self + child_max;
return depth[i];
};
int32_t max_depth = 0;
for (std::size_t i = 0; i < expr.nodes.size(); ++i) {
max_depth = std::max(max_depth, visit(static_cast<int32_t>(i)));
}
return max_depth;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 compute_max_mor_nesting at adstack_size_expr_eval.cpp:330-352 over-counts MOR depth by walking operand_a and operand_b as if they accumulate alongside body_node_idx, but in case_mor of the SPIR-V sizer shader (adstack_sizer_shader.cpp:706-767) begin and end subtrees evaluate fully BEFORE the outer MOR pushes its body frame, so they only contribute transient depth. The correct recurrence is max(visit(operand_a), visit(operand_b), self + visit(body_node_idx)); the current self + max(...) is sound but causes spurious QD_ERROR rejections at the kAdStackSizerMaxPendingFrames=16 cap for kernels whose true peak fits but whose begin/end MOR chains push the apparent depth over. Nit severity: over-conservative bound, not a memory-safety issue.

Extended reasoning...

What the bug is

compute_max_mor_nesting at quadrants/program/adstack_size_expr_eval.cpp:330-352 walks the post-order tree and, for any MaxOverRange node, returns:

int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0;
depth[i] = self + child_max;  // child_max = max(visit(op_a), visit(op_b), visit(body))

i.e. self + max(visit(operand_a), visit(operand_b), visit(body_node_idx)). The encoder hard-errors at encode_bytecode_common when this depth exceeds spirv::kAdStackSizerMaxPendingFrames = 16 (adstack_size_expr_eval.cpp:711-716).

Why the recurrence does not match the runtime

The SPIR-V sizer shader processes nodes iteratively in post-order. In case_mor (quadrants/codegen/spirv/adstack_sizer_shader.cpp:706-767):

  • Lines 709-710 read begin_i64 = load_values_at(op_a_i32) and end_i64 = load_values_at(op_b_i32) -- already-cached values from values_arr[]. Because the bytecode is post-order, the op_a and op_b subtrees were fully processed BEFORE control reached this MOR node: any nested MaxOverRange inside them executed its own case_mor (push), walked its body, and hit done_lbl (pop), restoring sp to whatever it was before that subtree began.
  • Lines 743-756 push exactly ONE pending frame for this MOR (the parent), then redirect current to body_start. The body chain runs under that pushed frame.

So during evaluation of operand_a / operand_b, the outer MOR is NOT yet on the pending stack. Only the body chain runs with the outer frame pushed. The actual peak pending-frame depth is:

peak(MOR) = max(peak(operand_a), peak(operand_b), 1 + peak(body))

NOT 1 + max(peak(operand_a), peak(operand_b), peak(body)) as the helper computes.

Step-by-step counterexample

Take MOR_outer(operand_a = MOR(MOR(MOR(plain))), operand_b = plain, body = plain):

  1. Linear walk processes the 3-deep nested MOR chain on the operand_a side first. Each nested MOR runs case_mor: push frame (sp += 1), walk body, hit done_lbl (pop, sp -= 1). The chain reaches a transient peak of sp = 3 while inside the innermost body, then unwinds back to sp = 0.
  2. After op_a finishes, op_b (a plain leaf) writes to values[] without touching the pending stack. sp stays at 0.
  3. Control reaches MOR_outer's case_mor. It pushes one frame (sp 0->1) and walks its body (a plain leaf), then pops (sp 1->0).

True peak across the whole evaluation = 3 (attained inside the begin chain, with MOR_outer not yet on the stack).

compute_max_mor_nesting(MOR_outer) returns: 1 + max(visit(begin), visit(end), visit(body)) = 1 + max(3, 0, 0) = 4.

The over-count is exactly +1 per outer MOR whose max(begin_depth, end_depth) > 1 + body_depth. With kAdStackSizerMaxPendingFrames = 16, a kernel whose true peak depth is 16 but whose begin/end MOR chains accumulate to 16+ via this recurrence is rejected at host encoding time even though it would run correctly.

Why existing code does not catch it

The cap check is the only consumer of compute_max_mor_nesting. It treats the helper's output as a worst-case at-runtime depth and rejects trees above 16. There is no runtime cross-check against the actual peak sp in the shader, so the discrepancy surfaces only as a false-positive compile error when the helper's result crosses 16 but the real peak does not. Since the inequality is always helper >= true_peak, the cap of 16 still safely prevents OOB on the shader-side pending_*_arr[16].

Impact

Severity is nit because the over-count is sound -- it only ever rejects kernels that would actually run correctly. Realistic reverse-mode kernels have shallow MOR nesting (verifiers note <=4 distinct bound vars, ~10 tree depth observed); the trigger requires deep MOR chains (>1 nesting) inside begin/end of an outer MOR, which is uncommon today but can compound through nested wrapping in larger kernels. No memory-safety hazard, no UB, no under-counting. The only user-visible effect is a spurious MaxOverRange nesting depth ... exceeds the sizer shader's ... compile error on a kernel that would otherwise fit.

Fix

One-line change at line 339-344:

const auto &n = expr.nodes[i];
int32_t a_depth = (n.operand_a >= 0) ? visit(n.operand_a) : 0;
int32_t b_depth = (n.operand_b >= 0) ? visit(n.operand_b) : 0;
int32_t body_depth = (n.body_node_idx >= 0) ? visit(n.body_node_idx) : 0;
int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0;
depth[i] = std::max({a_depth, b_depth, self + body_depth});

This matches the runtime peak exactly: begin and end subtrees contribute their own peaks (taken individually since they are sequential, not stacked), and the body contributes the outer MOR's pushed frame plus the body chain's own depth. The encoder's hard-error guard then fires only when the actual runtime peak would exceed 16.

Comment on lines +160 to +163
if (autodiff_mode == AutodiffMode::kReverse && ad_use_stack) {
irpass::determine_ad_stack_size(ir, config);
print("Autodiff stack size determined");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 scalarize drops AdStackAllocaStmt::size_expr (and stack_id) when lowering tensor-typed adstacks to scalar ones. Because this PR moves determine_ad_stack_size from offload_to_executable into compile_to_offloads (lines 160-163), the pre-pass now runs BEFORE scalarize and populates size_expr/max_size=1 on the tensor adstack; scalarize.cpp:719 then creates fresh scalar AdStackAllocaStmts via make_unique(element_type, stmt->max_size) which only copies max_size, leaving size_expr=nullptr on every scalar child. Any reverse-mode kernel with a Vector/Matrix loop-carried variable inside a field- or ndarray-bounded dynamic loop overflows on the first push and surfaces as 'Adstack overflow' at qd.sync(). Fix is one line in scalarize.cpp:719+: copy stmt->size_expr (a shared_ptr; safe to share since SizeExpr is documented immutable post-pre-pass) into each scalar child.

Extended reasoning...

What the bug is

The PR moves determine_ad_stack_size from offload_to_executable (after scalarize) to the end of compile_to_offloads at lines 160-163 (before scalarize). auto_diff.cpp:752 creates tensor-typed AdStackAllocaStmts for Vector/Matrix loop-carried variables (dtype = alloc->ret_type.ptr_removed()). The new pre-pass position means determine_ad_stack_size now sees these tensor-typed adstacks and, for non-const symbolic bounds, populates size_expr and seeds max_size = 1 as a placeholder (determine_ad_stack_size.cpp:1342-1344).

Then scalarize.cpp:712-728 runs (gated on config.real_matrix_scalarize, default true) and creates fresh scalar adstacks:

auto scalar_ad_stack = std::make_unique<AdStackAllocaStmt>(element_type, stmt->max_size);

The constructor at statements.h:1611 only initializes dt and max_size; size_expr (line 1606) and stack_id (line 1609) keep their default values of nullptr and -1. Each scalar child carries max_size=1 and size_expr=nullptr, then the original tensor adstack is erased (line 727).

Step-by-step proof

Concrete trigger:

@qd.kernel
def k():
    for i in x:
        v = qd.Vector([0.0, 0.0, 0.0])
        for _ in range(n[None]):  # field-bounded dynamic loop
            v = qd.sin(v) + 0.1
        y[i] = v
  1. auto_diff.cpp:752 makes AdStackAllocaStmt(TensorType<3 x f32>, 0) for v.
  2. Bellman-Ford leaves max_size=0 (positive loop).
  3. determine_ad_stack_size (now at compile_to_offloads.cpp:161) walks pushes, builds size_expr = MaxOverRange(... FieldLoad(n)), sets max_size=1 placeholder per the comment at determine_ad_stack_size.cpp:1338-1340.
  4. scalarize runs in offload_to_executable at line 302. visit(AdStackAllocaStmt) creates 3 scalar adstacks with make_unique<AdStackAllocaStmt>(element_type, 1)size_expr=nullptr on each.
  5. Codegen pre-scan in init_offloaded_task_function (codegen_llvm.cpp:1769-1782) assigns stack_id and pushes alloca->size_expr ? alloca->size_expr->serialize() : SerializedSizeExpr{} — empty for every scalar child. AdStackAllocaInfo::max_size_compile_time = 1.
  6. The codegen assertion stmt->max_size > 0 || stmt->size_expr passes (1 > 0). Stack id assigned, heap stride sized to 1 slot.
  7. Runtime publish_adstack_metadata (llvm_runtime_executor.cpp:671) sees size_exprs[i].nodes.empty(), falls back to max_size_compile_time=1. Per-thread heap holds 1 slot per scalar adstack.
  8. First push to v's scalar adstack overflows the per-thread slice. Surfaces as Adstack overflow at the next qd.sync().

Why existing code does not prevent it

Pre-PR, determine_ad_stack_size ran AFTER scalarize (in offload_to_executable, called at line 287 of the pre-PR file), so scalar adstacks created by scalarize had their max_size resolved by the analyzer directly. The new ordering inverts this: the symbolic bound is computed on the soon-to-be-deleted tensor adstack, then dropped on the floor.

Existing tensor-adstack tests in tests/python/test_adstack.py do not catch this: every Vector/Matrix reverse-mode test either uses ad_stack_size=N (the global override that bakes max_size=N into the alloca and bypasses the pre-pass entirely — the gather predicate at determine_ad_stack_size.cpp filters max_size==0) or uses a static iteration count (where Bellman-Ford resolves max_size directly to a constant > 1, which scalarize correctly copies).

Impact

Silent under-sizing of the per-thread adstack heap on a realistic kernel shape (Vector/Matrix loop-carried variables with field/ndarray-bounded dynamic loops). Surfaces as a runtime overflow at qd.sync() on every backend, with the misleading message that points at restructuring the loop or extending the grammar even though the structural pre-pass DID resolve it correctly. The grammar gap message is wrong for this case — the bound is fine, the propagation is broken.

Fix

One-line change in scalarize.cpp:719+:

auto scalar_ad_stack = std::make_unique<AdStackAllocaStmt>(element_type, stmt->max_size);
if (stmt->size_expr) {
  scalar_ad_stack->size_expr = stmt->size_expr;
}
scalar_ad_stack->ret_type = element_type;
scalar_ad_stack->ret_type.set_is_pointer(true);

SizeExpr is documented as immutable post-pre-pass (PR description's Side-effect audit row on Stmt clone / serialization), so the shared_ptr copy is cheap and safe to share across scalarized children. The original tensor adstack is being erased anyway, so the parent's shared_ptr ownership transfers naturally.

An alternative would be to re-run determine_ad_stack_size after scalarize (the deleted call site at offload_to_executable.cpp:288), but option (a) is smaller and matches the PR's stated immutability invariant.

@hughperkins
Copy link
Copy Markdown
Collaborator

Nice doc! Thanks!

Checklist:

  • doc has been updated, and I've reviewed
  • many of the new lines of code appear to be added to new files, dedicated to the new adstack sizing functionality (rather than mixed into existing files)

=> ok to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants