[AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends by duburcqa · Pull Request #599 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-30T18:02:36Z

Sparse adstack heap on every backend (Metal / Vulkan / CUDA / AMDGPU / CPU): introduce a per-task float / int split, an LCA-block lazy float-row claim, and a per-task reducer that sizes the float slab to the gate-passing iteration count

Pre-PR, the per-task adstack heap was a single combined slab sized at dispatched_threads * (stride_float + stride_int) * sizeof(elem), with every dispatched thread allocated a full row regardless of whether it ever entered the gate-controlled body. Genesis MPM mpm_grid_op_c65_0 reverse on Metal allocates a ~7.93 GB adstack heap on the pre-PR tree; the same kernel on Nvidia / AMDGPU LLVM allocates an even larger combined slab once split-stride publication is in place but before the lazy-row work and the dispatch cap land. This PR drops the combined-slab worst-case sizing across every backend by capturing a per-task bound_expr gate at IR analysis time, claiming heap rows at the gate's Lowest Common Ancestor (LCA) block instead of at the offload root, and sizing the float slab off the gate-passing count published by a per-task reducer.

TL;DR

Genesis MPM test_differentiable_push[gpu] grad allocations, before vs after, on the same workload:

Backend	Before	After
Metal (Apple M3, MoltenVK 1D)	~7.93 GB combined slab	~1.22 GB
CUDA (RTX 6000 Blackwell)	6.14 GB combined + 6.14 GB int + 0.52 -> 1.04 GB float	666 MB int + 0.52 -> 1.04 GB float
AMDGPU LLVM (Genesis MPM)	~12.8 GB combined slab	528 MB float + 636 MB int = 1.16 GB heap, 3.43 GB process VRAM peak (`HEAD=889cd8754`, `test_differentiable_push[gpu]`)
CPU (8 worker pool)	combined `num_cpu_threads * (stride_f + stride_i)` per task	int-only worst case + reducer-driven float

Aggregate: ~10-11x peak adstack-heap reduction on the tested workload (1.16 GB measured on AMDGPU at HEAD 889cd8754 vs ~12.8 GB pre-PR; same shape expected on CUDA, ~6.5x on Metal where pre-PR was already smaller).

The post-PR numbers were captured at HEAD 889cd8754 on AMDGPU via the persistent QD_DEBUG_ADSTACK=1 diagnostic added to the LLVM and SPIR-V heap-bind paths (runtime/llvm/llvm_runtime_executor.cpp::publish_adstack_metadata + ensure_per_task_float_heap_post_reducer, runtime/gfx/runtime.cpp::launch_kernel). Run QD_OFFLINE_CACHE=0 QD_DEBUG_ADSTACK=1 GS_ENABLE_NDARRAY=0 pytest --dev -n 0 -s "$HOME/workspace/src/genesis/tests/test_grad.py::test_differentiable_push[gpu]" to reproduce; one [adstack_heap] task='...' kind=F src=... line per heap-bind event records the source (reducer_count / last_observed_x1.5 / worst_case fallback) and the resulting allocation, so any memory regression can be debugged without re-instrumenting.

The savings come from three layers (LCA-block lazy row claim, per-task reducer-driven float-heap sizing, dispatch-thread cap on LLVM GPU backends to match SPIR-V's advisory_total_num_threads); each layer alone reduces peak by a smaller multiple but they compose.

Why

The pre-PR adstack heap layout has three independent over-allocation sources:

Float slab sized at dispatched-thread worst case. A reverse-mode kernel whose float adstack ops live below an if cell_active[i] > 0: gate only needs a heap row for each thread that PASSES the gate, but the host launcher cannot see the gate and conservatively allocates dispatched_threads * stride_float * sizeof(float). On Genesis MPM mpm_grid_op grad with ~604K dispatched and ~47K matched, that's a 13x over-allocation on the float slab alone.
Combined slab packs float + int strides. Every alloca's per-thread offset is a cumulative running sum within one combined slice; an int alloca whose adstack stride is small (loop-counter / branch-flag, typically tens of i32 entries) inflates the combined stride_float + stride_int even when the float side dominates.
LLVM GPU dispatch over-provisions threads. SPIR-V's generate_struct_for_kernel advisory caps total threads at 65536; the LLVM CUDA / AMDGPU launcher dispatches saturating_grid_dim * block_dim (~1.15M threads on a 144-SM Blackwell). Both backends grid-stride internally, so the wider LLVM dispatch is correctness-equivalent to the SPIR-V cap but pays ~17x heap memory at the same workload.

Without this PR, a Genesis MPM test_differentiable_push reverse-mode launch crosses Metal's maxBufferLength cap and [MTLDevice newBufferWithLength:] returns nil. PR #493 already hardened that path so the nil surfaces as RhiResult::out_of_memory and the launcher raises a clean RuntimeError rather than binding nil and silently reading zero from the float adstack heap (which is how the issue #2537 NaN reproducer manifested before #493). What remains on the current tree is the OOM itself: a workload that fits comfortably on Apple silicon's unified-memory budget cannot run because the per-launch heap is over-allocated by ~7x. This PR removes the over-allocation, so the kernel runs on Metal at ~1.22 GB instead of needing ~7.93 GB of heap, and the LLVM CUDA / AMDGPU equivalent drops from ~12.8 GB to ~1.16 GB (measured on AMDGPU) (the larger pre-PR figure on LLVM is from the dispatched-thread count being ~17x SPIR-V's, addressed in section 6 below).

Mechanism end-to-end

1. Shared static analysis (`quadrants/transforms/static_adstack_analysis.{h,cpp}`)

analyze_adstack_static_bounds(OffloadedStmt*, SNodeDescriptorResolver) walks the task body once, classifies each AdStackPushStmt as bootstrap or normal, computes the LCA of all float push / load-top / load-top-adj parent blocks, and captures any bound_expr that gates that LCA from above (ndarray-backed field[i] cmp literal or SNode-backed equivalent). Returns:

lca: the LCA block under which all non-bootstrap float adstack ops live, or null if there are no float adstack ops.
bootstrap_pushes: the autodiff-emitted constant-init pushes whose row index is irrelevant to the runtime gate (the codegen suppresses the slot store at those sites and relies on the count-only init path).
bound_expr: a serialised description of the gating predicate when it captures, including the SNode root id, byte-offset, and cell-stride for SNode sources, or ndarray_arg_id for ndarray sources.
per_thread_stride_float / per_thread_stride_int: entry-count compile-time worst cases used by the codegen for SSA bookkeeping.

Both backends call this function. The SPIR-V codegen builds its SNodeDescriptorResolver from compiled_structs_; the LLVM codegen builds it via spirv::compile_snode_structs(*prog->get_snode_root(matched_tree_id)) so SNode-backed gates carry the same root-buffer addressing the device-side reducer needs.

2. Per-kernel lazy-claim runtime arrays

Two new fields on the runtime struct (LLVMRuntime for LLVM; gfx-runtime equivalents for SPIR-V): adstack_row_counters[task_id] and adstack_bound_row_capacities[task_id]. The launcher allocates / clears both before the first task of every launch (publish_adstack_lazy_claim_buffers(num_tasks) on the LLVM side; the SPIR-V side initialises matching SSBOs in runtime/gfx/runtime.cpp). The codegen emits an atomicrmw add (OpAtomicIIncrement on SPIR-V) against adstack_row_counters[task_codegen_id] at the float-LCA block, stores the per-thread claimed row id into a function-scope row_id_var alloca, and clamps the result against adstack_bound_row_capacities[task_codegen_id] so threads that never reach the LCA never claim a row. The clamp explicitly guards capacity == 0 so the upper bound stays at row 0 instead of underflowing to UINT32_MAX.

3. Codegen split-heap routing

Both backends route allocas unconditionally:

f32 allocas in tasks with a captured bound_expr go on the lazy float-heap path: every push / load-top / load-top-adj / pop site recomputes the address as heap_float + row_id_var * stride_float + float_offset_within_float_slice. The row claim fires at the LCA, not at the offload root.
f32 allocas in tasks without a captured bound_expr use the eager path with the float heap: heap_float + linear_thread_idx * stride_float + float_offset.
i32 / u1 allocas always use the eager path with the int heap: heap_int + linear_thread_idx * stride_int + int_offset. Autodiff emits int-adstack pushes at the offload body root unconditionally for control-flow replay, so folding them into the float LCA computation would pull the LCA up to the offload root and eliminate the float-heap savings.

LLVM's ensure_ad_stack_heap_base_split_llvm() and ensure_ad_stack_metadata_split_llvm() cache the split-heap base / stride SSA values at entry_block once per task; SPIR-V's get_ad_stack_heap_thread_base_{float,int}() does the same in the SPIR-V codegen.

4. Per-launch heap sizing

Both backend host paths build the per-task host_offsets[] table with a single split-layout pass:

for each alloca:
  step = align_up_8(sizeof(int64_t) + entry_size_bytes * host_max_sizes[i])
  if alloca.heap_kind == Float: host_offsets[i] = stride_float_bytes; stride_float_bytes += step
  else:                         host_offsets[i] = stride_int_bytes;   stride_int_bytes   += step

Same scheme regardless of bound_expr. host_offsets[i] is now a within-slice byte offset; the codegen multiplies the right (linear_tid or row_id_var) row index by the matching per-kind stride and adds the offset. On LLVM, the device-side runtime_eval_adstack_size_expr (the GPU sizer kernel that resolves ExternalTensorRead-leaf size_exprs) also writes per-kind offsets - earlier drafts wrote the combined prefix sum, which would alias float and int slots on any kernel mixing both kinds with at least one ndarray-leaf size_expr.

The LLVM combined heap (runtime->adstack_heap_buffer) is no longer dereferenced by the codegen and is no longer allocated by the launcher; the field stays in LLVMRuntime for now so existing offline-cache-loaded kernels that load the combined-stride field can still link, but the published value mirrors stride_int_bytes so any such kernel observes the smaller int-only stride.

5. Per-arch device-side reducer + post-reducer float-heap sizing

Each launcher goes through this sequence per task:

publish_adstack_metadata(task.ad_stack, n, ctx, ...) - publishes the split offsets / strides as above.
publish_per_task_bound_count_*(task_index, task.ad_stack, length, ctx, ...) - on CPU walks the gating ndarray / SNode in host code; on CUDA / AMDGPU encodes the gate parameters into a LlvmAdStackBoundReducerDeviceParams struct and dispatches a single-thread runtime kernel (runtime_eval_static_bound_count) that walks the same source on device and writes the count into adstack_bound_row_capacities[task_index]. The reducer kernel handles both ndarray (ctx->arg_buffer + arg_word_offset) and SNode (runtime->roots[snode_root_id] + byte_base_offset + i * cell_stride) sources. SPIR-V uses an equivalent compute-shader reducer dispatched from runtime/gfx/adstack_bound_reducer_launch.cpp.
ensure_per_task_float_heap_post_reducer(task_index, task.ad_stack, n) - reads the count back (host load on CPU; small DtoH on CUDA / AMDGPU; SSBO mapping on SPIR-V), sizes the float heap to max(count, 1) * stride_float_bytes. Grow-on-demand is amortised-doubling so a sequence of monotonically-growing counts costs O(log peak) reallocations.

Reducer length comes from the gating ndarray's full flat element count (array_runtime_sizes[arg_id] / sizeof(elem) on LLVM; equivalent resolve_length over range_for_attribs->end_shape_product on SPIR-V) rather than the dispatched / worker-pool thread count: the lazy row-claim atomic-rmw fires once per LCA execution, and grid-strided GPU kernel bodies (gpu_parallel_struct_for with i = block_idx(); i += grid_dim(), gpu_parallel_range_for with idx += block_dim() * grid_dim()) plus CPU per-iteration invocations (cpu_parallel_range_for_task running each iteration on its own stack frame) can hit the LCA more times than there are concurrent dispatched threads. Walking the reducer over the full gating ndarray keeps bound_row_capacities[task_index] consistent with the total claim count.

6. CUDA / AMDGPU adstack-bearing-task dispatch cap

runtime/cuda/kernel_launcher.cpp and runtime/amdgpu/kernel_launcher.cpp define kAdStackMaxConcurrentThreads = 65536 (matching SPIR-V's generate_struct_for_kernel advisory) and apply two caps for tasks whose task.ad_stack.allocas is non-empty:

resolve_num_threads(...) clamps the heap-sizing thread count to kAdStackMaxConcurrentThreads so ensure_adstack_heap_{int,float} allocates rows for at most that many threads.
The per-task launch grid is capped to ceil(kAdStackMaxConcurrentThreads / task.block_dim) blocks before cuda_module->launch(...) / amdgpu_module->launch(...) so the kernel actually dispatches at most that many concurrent threads. The runtime-side grid-strided loops cover the full element list / range with fewer dispatched threads at the cost of more iterations per thread.

Tasks without an adstack keep the codegen-emitted task.grid_dim = saturating_grid_dim for max throughput.

Per-backend coverage matrix

Backend	Heap layout	Float-heap row index	Float-heap sizing	Dispatch cap
CPU	split float / int	`cpu_thread_id` (eager) or claimed-row (lazy under bound_expr)	host-eval reducer count post-reducer (bound_expr) or worst case	n/a (worker pool already tight)
CUDA	split float / int	`linear_thread_idx` (eager) or claimed-row (lazy under bound_expr)	device-side reducer count post-reducer (bound_expr) or worst case	65536 concurrent threads
AMDGPU	split float / int	same as CUDA	device-side reducer count post-reducer (bound_expr) or worst case	65536 concurrent threads
Metal / Vulkan (SPIR-V)	split `BufferType::AdStackHeapFloat` + `AdStackHeapInt`	`gl_GlobalInvocationID` (eager) or claimed-row (lazy under bound_expr)	compute-shader reducer count post-reducer (bound_expr) or worst case	`advisory_total_num_threads = 65536`

Tests

Test	Pins	Backends
`test_adstack_static_bound_expr_ndarray_gate_grad_correct`	end-to-end ndarray-gated reverse mode at `gated_fraction in {0.0, 0.05, 0.5, 1.0}`. The 0.0 axis exercises the capacity-zero clamp guard	every adstack-supporting arch
`test_adstack_static_bound_expr_snode_gate_grad_correct`	SNode-backed gate (`qd.field` under `qd.root.dense`); the analyser captures the SNode descriptor triple and the device-side reducer / SPIR-V shader walks the root buffer directly	every adstack-supporting arch
`test_adstack_static_bound_expr_snode_gate_cpu_grad_correct`	LLVM CPU host-reducer SNode arm of `publish_per_task_bound_count_cpu`. Reverting the SNode arm SIGSEGVs at `compute.grad` on macOS arm64	`qd.cpu`
`test_adstack_static_bound_expr_ndarray_gate_debug_build_grad_correct`	debug-build alloca-site `stack_init` skip in the lazy float branch + the bootstrap-PUSH skip; parametrised on alloca-inside / alloca-outside the gate	every adstack-supporting arch, `debug=True`
`test_adstack_static_bound_expr_memory_savings_runs_clean`	every supported `SizeExpr` shape (int const / scalar field / ndarray shape / ndarray read / two-arg range) end-to-end through the bound-expr capture path. Catches a regression that drops a specific bound shape from the analyser	every adstack-supporting arch
`test_adstack_static_bound_expr_primal_dependent_inner_recurrence_grad_correct`	primal-dependent reverse chain (`v = x[i]^2` then n_iter recurrence) so any heap-aliasing regression appears as wrong per-i gradients	every adstack-supporting arch
`test_adstack_static_bound_expr_non_loop_var_index_falls_back_to_worst_case`	`match_field_source` rejection of non-LoopIndex gate indices (e.g. `selector[i % K]`); the rejected capture falls back to the worst-case sizing path	f64-capable archs
`test_adstack_static_bound_expr_device_sizer_per_kind_offsets_grad_correct`	LLVM CUDA / AMDGPU `runtime_eval_adstack_size_expr` per-kind `out_offsets[i]` write. Reverting to the combined prefix sum aliases float / int slots and produces wrong-but-not-NaN gradients	`qd.cuda`, `qd.amdgpu`
`test_adstack_gpu_dispatch_cap_uses_floor_division`	LLVM CUDA / AMDGPU adstack-bearing-task dispatch-cap floor division. Ceiling division (`block_dim=192`, `n=65700`, `ad_stack_size=2048`) over-dispatches by `block_dim - 1` threads past the heap row count and faults as `hipErrorIllegalAddress` / `cudaErrorIllegalAddress` at `compute.grad`	`qd.cuda`, `qd.amdgpu`
`test_adstack_static_bound_expr_f64_gate_grad_correct`	SPIR-V bound-reducer f64 gating-field arm: launcher splits the f64 literal across `(threshold_bits, threshold_bits_high)` and the shader walks f64 cells with two-u32 PSB loads reassembled into a u64. Reverting the arm decodes the threshold as 0.0 and over-counts gate-passing cells	f64-capable adstack archs
`test_adstack_static_bound_expr_resolve_length_walks_full_ndarray`	SPIR-V launcher's `resolve_length` walking the full ndarray flat product instead of capping at `kMaxNumThreadsGridStrideLoop = 131072`. Pre-fix the reducer counts 0 gate-passing cells past the cap and the runtime sync raises the divergence-overflow signal	`qd.metal`, `qd.vulkan`
`test_adstack_overflow_raises` / `..._reset_after_catch`	end-to-end overflow signal handling on `qd.sync()` raising `RuntimeError("[Aa]dstack overflow")` and clearing the flag for the next launch	every adstack-supporting arch

Side-effect audit

Concern	Where checked	Verdict
Offline cache key	`analysis/offline_cache_util.cpp`, `analysis/gen_offline_cache_key.cpp`	This PR adds no new IR fields that participate in correctness; the heap layout is a host-side launcher decision driven by per-launch SizeExpr eval. Cache key unchanged.
Stmt clone / serialization	`QD_STMT_DEF_FIELDS(...)` on `AdStackAllocaStmt`	Auto-covered.
IR equality (`same_statements`) / WholeKernelCSE	`field_manager.equal()` on `AdStackAllocaStmt`	Auto-covered.
Combined-heap field still on LLVMRuntime	`runtime->adstack_heap_buffer`, `_size`, `_per_thread_stride`	Field retained as a transitional fallback; not allocated by the launcher and not dereferenced by freshly-compiled kernels. Removing the field would invalidate offline-cache-loaded kernels that predate the split, so it stays for at least one cycle.
Dispatch cap on non-adstack tasks	`runtime/cuda/kernel_launcher.cpp`, `runtime/amdgpu/kernel_launcher.cpp`	Cap is gated on `!task.ad_stack.allocas.empty()`; tasks without an adstack keep `saturating_grid_dim` unchanged.
Debug-build alloca-site init for lazy float allocas	`codegen_llvm.cpp::visit(AdStackAllocaStmt)`, `visit(Block*)`	`stack_init` for lazy float allocas is emitted at the LCA block (after the row claim), not at the offload root where `row_id_var` is still UINT32_MAX. Release build uses the per-stack count alloca and is unaffected.
Capacity = 0 underflow on the LCA-block clamp	`codegen_llvm.cpp::emit_ad_stack_row_claim_llvm`	Explicit `select(capacity == 0, 0, capacity - 1)` so the clamp upper bound stays in-bounds when the reducer reports zero matches; the launcher floors the heap allocation at one row precisely so the single-slot fallback is always backed by real storage.
Device sizer offset writes	`runtime/llvm/runtime_module/runtime.cpp::runtime_eval_adstack_size_expr`	`out_offsets[i]` is a per-kind byte offset within the float-only or int-only slice (mirrors the host-eval branch and the SPIR-V sizer's `OpSelect`). Earlier drafts wrote the combined prefix sum, which would alias float and int slots on any kernel mixing both kinds with at least one ndarray-leaf size_expr.
MSVC linker dependency	`quadrants/codegen/llvm/CMakeLists.txt`	`llvm_codegen` now explicitly links against `spirv_codegen` for the `compile_snode_structs` call the SNode-backed-gate descriptor resolver makes. Linux / Mac satisfied this transitively via the final shared-module link order; MSVC's linker requires the explicit dep.
SPIR-V f64 gating-field arm	`codegen/spirv/adstack_bound_reducer_shader.{h,cpp}`, `runtime/gfx/adstack_bound_reducer_launch.cpp`	`AdStackBoundReducerParams` carries `field_dtype_is_double` + `threshold_bits_high`; the launcher splits `reinterpret_cast<const uint64_t >(&literal_f64)` across the lo / hi u32 pair and the shader walks f64 cells via `psb_load_u64_pair` (two adjacent 4-byte u32 loads + register reassembly) into an f64 OpFOrd* compare arm. Devices without `spirv_has_float64` keep the f64 inner arm code-stripped at shader build time and the launcher's matched-task filter drops f64 captures back to dispatched-threads worst-case sizing.
LLVM CUDA / AMDGPU `bound_count_length` shape walk	`runtime/cuda/kernel_launcher.cpp`, `runtime/amdgpu/kernel_launcher.cpp`	The shape walk uses `ctx.get_struct_arg_host<int32_t>(indices)`, NOT `get_struct_arg`. `launch_llvm_kernel` swaps `ctx_->arg_buffer` to a device pointer (cuda:269-274 / amdgpu:230-235) before `launch_offloaded_tasks` runs, so a plain `get_struct_arg` would dereference device memory from the host (SIGSEGV / `CUDA_ERROR_ILLEGAL_ADDRESS` on drivers without HMM, garbage `flat_len` on HMM-capable setups). The host backing buffer `arg_buffer_` stays host-resident across the swap.
Cap-missing devices: AdStackBoundRowCapacity buffer	`runtime/gfx/adstack_bound_reducer_launch.cpp::dispatch_adstack_bound_reducers`	Capacity-buffer alloc + UINT32_MAX fill is hoisted ABOVE the PSB / Int64 capability gates so cap-missing devices (pre-Apple7 Metal, Vulkan-1.1 mobile drivers without `shaderInt64` / `bufferDeviceAddress`) still receive inert defaults the codegen clamp leaves alone. Without the hoist the bind path routes `kDeviceNullAllocation` to the descriptor slot, robustBufferAccess returns 0, the divergence-overflow `OpAtomicUMax` fires unconditionally and every adstack-bearing kernel hard-errors at sync.
`last_observed_rows_per_task_` heap-bind tertiary fallback	`runtime/gfx/runtime.cpp` heap-bind path	Tasks the reducer did not pre-count (no captured `bound_expr`, compound gate predicate, capability-missing device) size from `ceil(last_observed * 1.5)` instead of `dispatched_threads` worst case when a prior `synchronize()` snapshot recorded the LCA claim count for the same task name. The 1.5x cushion absorbs run-to-run variance without forcing amortized-doubling reallocation on every modest workload uplift. The int heap stays at the dispatched-threads worst case because int allocas use the eager `linear_tid * stride_int` mapping.
`snode_resolver` tree-id scan bound	`codegen/llvm/codegen_llvm.cpp::init_offloaded_task_function`	The scan is bounded with `prog->get_snode_tree_size()` and `continue`s past nullptr slots (recycled tree-id holes from `free_snode_tree_ids_`). `Program::get_snode_root` is a raw `snode_trees_[id]->root()` with no bounds check, so an unbounded loop is `std::vector::operator[]` UB on stale-IR / cross-program / offline-cache-restore paths.
SPIR-V `resolve_length` walks full ndarray	`runtime/gfx/adstack_bound_reducer_launch.cpp::resolve_length_ndarray`	Walks the gating ndarray's full flat element product through `host_ctx.get_struct_arg<int32_t>(indices)` instead of capping at `advisory_total_num_threads`. Pre-fix kernels with N > 131072 (range_for cap) under-counted gate-passing cells past the cap; the float adstack heap was sized to the truncated count and the codegen-emitted clamp aliased every later gated iteration into the smaller row range.

hughperkins · 2026-04-30T18:03:36Z

7x 🔥

duburcqa · 2026-04-30T18:07:30Z

@claude review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 877298a8f9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-04-30T23:46:06Z

Coverage Report (`9c1504d5d`)

File	Coverage	Missing
🔴 `python/quadrants/_tensor_wrapper.py`	0%	208-209
🔴 `python/quadrants/lang/_ndarray.py`	33%	90,106
🔴 `python/quadrants/lang/field.py`	22%	93-97,513,530
🔴 `python/quadrants/lang/matrix.py`	50%	1293
🟢 `tests/python/test_adstack.py`	90%	3493-3495,3497,3499-3500,3502-3509,3511-3514,3516-3518,3520-3524,3526-3527

Diff coverage: 86% · Overall: 74% · 287 lines, 40 missing

Full annotated report

…ter slot + post-launch readback

…und expr)

…e shader)

…eline + params buffer members

…and size float heap from count

…Conditional and reflow comments

…ay-gated kernel across active fractions

…on reducer / main divergence

…ed gating fields via root buffer

…ess across active fractions

…nfrastructure

…egen

… bases, per-kind strides

…the LCA block (dormant

hughperkins · 2026-05-01T15:53:04Z

-Quadrants implements autodiff at compile time: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework. Forward mode and reverse mode are available on every backend Quadrants targets: x64 / arm64 CPU, CUDA, AMDGPU, Metal, and Vulkan. Reverse-mode AD through dynamic loops (described further down) is currently behind an opt-in `ad_stack_experimental_enabled=True` flag.
+Quadrants implements autodiff at compile time: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework. Forward mode and reverse mode are available on every backend Quadrants targets: x64 / arm64 CPU, CUDA, AMDGPU, Metal, and Vulkan.
+
+**Recommendation.** Reverse-mode AD through dynamic loops (described further down) is currently behind an opt-in `ad_stack_experimental_enabled=True` flag at `qd.init`. We strongly recommend systematically enabling this flag as it is required for any reverse-mode kernel with a runtime-bounded loop carrying a non-linear primal, and free for every other kernel. See [the cost breakdown](./init_options.md#ad_stack_experimental_enabled) for details.


"If you are using autodiff at all, we recommend"

…elling autodiff users to enable adstack

…strongly recommend' to 'if you are using autodiff at all, we recommend'

….md label drift, in-LCA-block stack_init defense

github-actions · 2026-05-01T17:08:25Z

Coverage Report (`2c8d275af`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	94%	3750-3752,3754,3756-3757,3759-3766,3768-3771,3773-3775,3777-3781,3783-3784

Diff coverage: 94% · Overall: 74% · 509 lines, 28 missing

Full annotated report

…IR-V dispatch at 65536, pin compound-index tests

github-actions · 2026-05-01T18:58:34Z

Coverage Report (`1c0011d70`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	99%	3344,3809-3814

Diff coverage: 99% · Overall: 74% · 546 lines, 7 missing

Full annotated report

hughperkins · 2026-05-01T19:20:41Z

(totally orthogonal to your own PR, I feel like my kernel coverage is somehow not doing coverage on non-kernels 🤔 Thats a bug I should fix. It is supposed to.)

hughperkins · 2026-05-01T19:22:36Z

The line wrap CI flags seem valid I think? eg:

…asing memcpy, f64-cap assert)

…ion as future work in autodiff.md

…tion bullet with ⚠️ emoji marker

hughperkins · 2026-05-01T21:11:15Z

checklist:

doc updated
Genesis benchmarks neutral
Genesis unit tests passing

=> ok to merge

github-actions · 2026-05-01T21:50:20Z

Coverage Report (`845bd82c6`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	98%	3361-3366,3812-3817

Diff coverage: 98% · Overall: 74% · 545 lines, 12 missing

Full annotated report

duburcqa changed the title ~~[AutoDiff] Cut reverse-mode adstack VRAM ~7x on CPU / CUDA / AMDGPU~~ [AutoDiff] Cut reverse-mode adstack memory usage on all backends Apr 30, 2026

duburcqa changed the title ~~[AutoDiff] Cut reverse-mode adstack memory usage on all backends~~ [AutoDiff] Cut reverse-mode adstack memory usage ~7x on all backends Apr 30, 2026

duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from 877298a to 225d087 Compare April 30, 2026 18:06

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread quadrants/runtime/cuda/kernel_launcher.cpp Outdated

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated