Skip to content

[AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends#599

Merged
duburcqa merged 88 commits intomainfrom
duburcqa/sparse_adstack_heap
May 1, 2026
Merged

[AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends#599
duburcqa merged 88 commits intomainfrom
duburcqa/sparse_adstack_heap

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 30, 2026

Sparse adstack heap on every backend (Metal / Vulkan / CUDA / AMDGPU / CPU): introduce a per-task float / int split, an LCA-block lazy float-row claim, and a per-task reducer that sizes the float slab to the gate-passing iteration count

Pre-PR, the per-task adstack heap was a single combined slab sized at dispatched_threads * (stride_float + stride_int) * sizeof(elem), with every dispatched thread allocated a full row regardless of whether it ever entered the gate-controlled body. Genesis MPM mpm_grid_op_c65_0 reverse on Metal allocates a ~7.93 GB adstack heap on the pre-PR tree; the same kernel on Nvidia / AMDGPU LLVM allocates an even larger combined slab once split-stride publication is in place but before the lazy-row work and the dispatch cap land. This PR drops the combined-slab worst-case sizing across every backend by capturing a per-task bound_expr gate at IR analysis time, claiming heap rows at the gate's Lowest Common Ancestor (LCA) block instead of at the offload root, and sizing the float slab off the gate-passing count published by a per-task reducer.

TL;DR

Genesis MPM test_differentiable_push[gpu] grad allocations, before vs after, on the same workload:

Backend Before After
Metal (Apple M3, MoltenVK 1D) ~7.93 GB combined slab ~1.22 GB
CUDA (RTX 6000 Blackwell) 6.14 GB combined + 6.14 GB int + 0.52 -> 1.04 GB float 666 MB int + 0.52 -> 1.04 GB float
AMDGPU LLVM (Genesis MPM) ~12.8 GB combined slab 528 MB float + 636 MB int = 1.16 GB heap, 3.43 GB process VRAM peak (HEAD=889cd8754, test_differentiable_push[gpu])
CPU (8 worker pool) combined num_cpu_threads * (stride_f + stride_i) per task int-only worst case + reducer-driven float

Aggregate: ~10-11x peak adstack-heap reduction on the tested workload (1.16 GB measured on AMDGPU at HEAD 889cd8754 vs ~12.8 GB pre-PR; same shape expected on CUDA, ~6.5x on Metal where pre-PR was already smaller).

The post-PR numbers were captured at HEAD 889cd8754 on AMDGPU via the persistent QD_DEBUG_ADSTACK=1 diagnostic added to the LLVM and SPIR-V heap-bind paths (runtime/llvm/llvm_runtime_executor.cpp::publish_adstack_metadata + ensure_per_task_float_heap_post_reducer, runtime/gfx/runtime.cpp::launch_kernel). Run QD_OFFLINE_CACHE=0 QD_DEBUG_ADSTACK=1 GS_ENABLE_NDARRAY=0 pytest --dev -n 0 -s "$HOME/workspace/src/genesis/tests/test_grad.py::test_differentiable_push[gpu]" to reproduce; one [adstack_heap] task='...' kind=F src=... line per heap-bind event records the source (reducer_count / last_observed_x1.5 / worst_case fallback) and the resulting allocation, so any memory regression can be debugged without re-instrumenting.

The savings come from three layers (LCA-block lazy row claim, per-task reducer-driven float-heap sizing, dispatch-thread cap on LLVM GPU backends to match SPIR-V's advisory_total_num_threads); each layer alone reduces peak by a smaller multiple but they compose.

Why

The pre-PR adstack heap layout has three independent over-allocation sources:

  • Float slab sized at dispatched-thread worst case. A reverse-mode kernel whose float adstack ops live below an if cell_active[i] > 0: gate only needs a heap row for each thread that PASSES the gate, but the host launcher cannot see the gate and conservatively allocates dispatched_threads * stride_float * sizeof(float). On Genesis MPM mpm_grid_op grad with ~604K dispatched and ~47K matched, that's a 13x over-allocation on the float slab alone.
  • Combined slab packs float + int strides. Every alloca's per-thread offset is a cumulative running sum within one combined slice; an int alloca whose adstack stride is small (loop-counter / branch-flag, typically tens of i32 entries) inflates the combined stride_float + stride_int even when the float side dominates.
  • LLVM GPU dispatch over-provisions threads. SPIR-V's generate_struct_for_kernel advisory caps total threads at 65536; the LLVM CUDA / AMDGPU launcher dispatches saturating_grid_dim * block_dim (~1.15M threads on a 144-SM Blackwell). Both backends grid-stride internally, so the wider LLVM dispatch is correctness-equivalent to the SPIR-V cap but pays ~17x heap memory at the same workload.

Without this PR, a Genesis MPM test_differentiable_push reverse-mode launch crosses Metal's maxBufferLength cap and [MTLDevice newBufferWithLength:] returns nil. PR #493 already hardened that path so the nil surfaces as RhiResult::out_of_memory and the launcher raises a clean RuntimeError rather than binding nil and silently reading zero from the float adstack heap (which is how the issue #2537 NaN reproducer manifested before #493). What remains on the current tree is the OOM itself: a workload that fits comfortably on Apple silicon's unified-memory budget cannot run because the per-launch heap is over-allocated by ~7x. This PR removes the over-allocation, so the kernel runs on Metal at ~1.22 GB instead of needing ~7.93 GB of heap, and the LLVM CUDA / AMDGPU equivalent drops from ~12.8 GB to ~1.16 GB (measured on AMDGPU) (the larger pre-PR figure on LLVM is from the dispatched-thread count being ~17x SPIR-V's, addressed in section 6 below).

Mechanism end-to-end

1. Shared static analysis (quadrants/transforms/static_adstack_analysis.{h,cpp})

analyze_adstack_static_bounds(OffloadedStmt*, SNodeDescriptorResolver) walks the task body once, classifies each AdStackPushStmt as bootstrap or normal, computes the LCA of all float push / load-top / load-top-adj parent blocks, and captures any bound_expr that gates that LCA from above (ndarray-backed field[i] cmp literal or SNode-backed equivalent). Returns:

  • lca: the LCA block under which all non-bootstrap float adstack ops live, or null if there are no float adstack ops.
  • bootstrap_pushes: the autodiff-emitted constant-init pushes whose row index is irrelevant to the runtime gate (the codegen suppresses the slot store at those sites and relies on the count-only init path).
  • bound_expr: a serialised description of the gating predicate when it captures, including the SNode root id, byte-offset, and cell-stride for SNode sources, or ndarray_arg_id for ndarray sources.
  • per_thread_stride_float / per_thread_stride_int: entry-count compile-time worst cases used by the codegen for SSA bookkeeping.

Both backends call this function. The SPIR-V codegen builds its SNodeDescriptorResolver from compiled_structs_; the LLVM codegen builds it via spirv::compile_snode_structs(*prog->get_snode_root(matched_tree_id)) so SNode-backed gates carry the same root-buffer addressing the device-side reducer needs.

2. Per-kernel lazy-claim runtime arrays

Two new fields on the runtime struct (LLVMRuntime for LLVM; gfx-runtime equivalents for SPIR-V): adstack_row_counters[task_id] and adstack_bound_row_capacities[task_id]. The launcher allocates / clears both before the first task of every launch (publish_adstack_lazy_claim_buffers(num_tasks) on the LLVM side; the SPIR-V side initialises matching SSBOs in runtime/gfx/runtime.cpp). The codegen emits an atomicrmw add (OpAtomicIIncrement on SPIR-V) against adstack_row_counters[task_codegen_id] at the float-LCA block, stores the per-thread claimed row id into a function-scope row_id_var alloca, and clamps the result against adstack_bound_row_capacities[task_codegen_id] so threads that never reach the LCA never claim a row. The clamp explicitly guards capacity == 0 so the upper bound stays at row 0 instead of underflowing to UINT32_MAX.

3. Codegen split-heap routing

Both backends route allocas unconditionally:

  • f32 allocas in tasks with a captured bound_expr go on the lazy float-heap path: every push / load-top / load-top-adj / pop site recomputes the address as heap_float + row_id_var * stride_float + float_offset_within_float_slice. The row claim fires at the LCA, not at the offload root.
  • f32 allocas in tasks without a captured bound_expr use the eager path with the float heap: heap_float + linear_thread_idx * stride_float + float_offset.
  • i32 / u1 allocas always use the eager path with the int heap: heap_int + linear_thread_idx * stride_int + int_offset. Autodiff emits int-adstack pushes at the offload body root unconditionally for control-flow replay, so folding them into the float LCA computation would pull the LCA up to the offload root and eliminate the float-heap savings.

LLVM's ensure_ad_stack_heap_base_split_llvm() and ensure_ad_stack_metadata_split_llvm() cache the split-heap base / stride SSA values at entry_block once per task; SPIR-V's get_ad_stack_heap_thread_base_{float,int}() does the same in the SPIR-V codegen.

4. Per-launch heap sizing

Both backend host paths build the per-task host_offsets[] table with a single split-layout pass:

for each alloca:
  step = align_up_8(sizeof(int64_t) + entry_size_bytes * host_max_sizes[i])
  if alloca.heap_kind == Float: host_offsets[i] = stride_float_bytes; stride_float_bytes += step
  else:                         host_offsets[i] = stride_int_bytes;   stride_int_bytes   += step

Same scheme regardless of bound_expr. host_offsets[i] is now a within-slice byte offset; the codegen multiplies the right (linear_tid or row_id_var) row index by the matching per-kind stride and adds the offset. On LLVM, the device-side runtime_eval_adstack_size_expr (the GPU sizer kernel that resolves ExternalTensorRead-leaf size_exprs) also writes per-kind offsets - earlier drafts wrote the combined prefix sum, which would alias float and int slots on any kernel mixing both kinds with at least one ndarray-leaf size_expr.

The LLVM combined heap (runtime->adstack_heap_buffer) is no longer dereferenced by the codegen and is no longer allocated by the launcher; the field stays in LLVMRuntime for now so existing offline-cache-loaded kernels that load the combined-stride field can still link, but the published value mirrors stride_int_bytes so any such kernel observes the smaller int-only stride.

5. Per-arch device-side reducer + post-reducer float-heap sizing

Each launcher goes through this sequence per task:

  1. publish_adstack_metadata(task.ad_stack, n, ctx, ...) - publishes the split offsets / strides as above.
  2. publish_per_task_bound_count_*(task_index, task.ad_stack, length, ctx, ...) - on CPU walks the gating ndarray / SNode in host code; on CUDA / AMDGPU encodes the gate parameters into a LlvmAdStackBoundReducerDeviceParams struct and dispatches a single-thread runtime kernel (runtime_eval_static_bound_count) that walks the same source on device and writes the count into adstack_bound_row_capacities[task_index]. The reducer kernel handles both ndarray (ctx->arg_buffer + arg_word_offset) and SNode (runtime->roots[snode_root_id] + byte_base_offset + i * cell_stride) sources. SPIR-V uses an equivalent compute-shader reducer dispatched from runtime/gfx/adstack_bound_reducer_launch.cpp.
  3. ensure_per_task_float_heap_post_reducer(task_index, task.ad_stack, n) - reads the count back (host load on CPU; small DtoH on CUDA / AMDGPU; SSBO mapping on SPIR-V), sizes the float heap to max(count, 1) * stride_float_bytes. Grow-on-demand is amortised-doubling so a sequence of monotonically-growing counts costs O(log peak) reallocations.

Reducer length comes from the gating ndarray's full flat element count (array_runtime_sizes[arg_id] / sizeof(elem) on LLVM; equivalent resolve_length over range_for_attribs->end_shape_product on SPIR-V) rather than the dispatched / worker-pool thread count: the lazy row-claim atomic-rmw fires once per LCA execution, and grid-strided GPU kernel bodies (gpu_parallel_struct_for with i = block_idx(); i += grid_dim(), gpu_parallel_range_for with idx += block_dim() * grid_dim()) plus CPU per-iteration invocations (cpu_parallel_range_for_task running each iteration on its own stack frame) can hit the LCA more times than there are concurrent dispatched threads. Walking the reducer over the full gating ndarray keeps bound_row_capacities[task_index] consistent with the total claim count.

6. CUDA / AMDGPU adstack-bearing-task dispatch cap

runtime/cuda/kernel_launcher.cpp and runtime/amdgpu/kernel_launcher.cpp define kAdStackMaxConcurrentThreads = 65536 (matching SPIR-V's generate_struct_for_kernel advisory) and apply two caps for tasks whose task.ad_stack.allocas is non-empty:

  • resolve_num_threads(...) clamps the heap-sizing thread count to kAdStackMaxConcurrentThreads so ensure_adstack_heap_{int,float} allocates rows for at most that many threads.
  • The per-task launch grid is capped to ceil(kAdStackMaxConcurrentThreads / task.block_dim) blocks before cuda_module->launch(...) / amdgpu_module->launch(...) so the kernel actually dispatches at most that many concurrent threads. The runtime-side grid-strided loops cover the full element list / range with fewer dispatched threads at the cost of more iterations per thread.

Tasks without an adstack keep the codegen-emitted task.grid_dim = saturating_grid_dim for max throughput.

Per-backend coverage matrix

Backend Heap layout Float-heap row index Float-heap sizing Dispatch cap
CPU split float / int cpu_thread_id (eager) or claimed-row (lazy under bound_expr) host-eval reducer count post-reducer (bound_expr) or worst case n/a (worker pool already tight)
CUDA split float / int linear_thread_idx (eager) or claimed-row (lazy under bound_expr) device-side reducer count post-reducer (bound_expr) or worst case 65536 concurrent threads
AMDGPU split float / int same as CUDA device-side reducer count post-reducer (bound_expr) or worst case 65536 concurrent threads
Metal / Vulkan (SPIR-V) split BufferType::AdStackHeapFloat + AdStackHeapInt gl_GlobalInvocationID (eager) or claimed-row (lazy under bound_expr) compute-shader reducer count post-reducer (bound_expr) or worst case advisory_total_num_threads = 65536

Tests

Test Pins Backends
test_adstack_static_bound_expr_ndarray_gate_grad_correct end-to-end ndarray-gated reverse mode at gated_fraction in {0.0, 0.05, 0.5, 1.0}. The 0.0 axis exercises the capacity-zero clamp guard every adstack-supporting arch
test_adstack_static_bound_expr_snode_gate_grad_correct SNode-backed gate (qd.field under qd.root.dense); the analyser captures the SNode descriptor triple and the device-side reducer / SPIR-V shader walks the root buffer directly every adstack-supporting arch
test_adstack_static_bound_expr_snode_gate_cpu_grad_correct LLVM CPU host-reducer SNode arm of publish_per_task_bound_count_cpu. Reverting the SNode arm SIGSEGVs at compute.grad on macOS arm64 qd.cpu
test_adstack_static_bound_expr_ndarray_gate_debug_build_grad_correct debug-build alloca-site stack_init skip in the lazy float branch + the bootstrap-PUSH skip; parametrised on alloca-inside / alloca-outside the gate every adstack-supporting arch, debug=True
test_adstack_static_bound_expr_memory_savings_runs_clean every supported SizeExpr shape (int const / scalar field / ndarray shape / ndarray read / two-arg range) end-to-end through the bound-expr capture path. Catches a regression that drops a specific bound shape from the analyser every adstack-supporting arch
test_adstack_static_bound_expr_primal_dependent_inner_recurrence_grad_correct primal-dependent reverse chain (v = x[i]^2 then n_iter recurrence) so any heap-aliasing regression appears as wrong per-i gradients every adstack-supporting arch
test_adstack_static_bound_expr_non_loop_var_index_falls_back_to_worst_case match_field_source rejection of non-LoopIndex gate indices (e.g. selector[i % K]); the rejected capture falls back to the worst-case sizing path f64-capable archs
test_adstack_static_bound_expr_device_sizer_per_kind_offsets_grad_correct LLVM CUDA / AMDGPU runtime_eval_adstack_size_expr per-kind out_offsets[i] write. Reverting to the combined prefix sum aliases float / int slots and produces wrong-but-not-NaN gradients qd.cuda, qd.amdgpu
test_adstack_gpu_dispatch_cap_uses_floor_division LLVM CUDA / AMDGPU adstack-bearing-task dispatch-cap floor division. Ceiling division (block_dim=192, n=65700, ad_stack_size=2048) over-dispatches by block_dim - 1 threads past the heap row count and faults as hipErrorIllegalAddress / cudaErrorIllegalAddress at compute.grad qd.cuda, qd.amdgpu
test_adstack_static_bound_expr_f64_gate_grad_correct SPIR-V bound-reducer f64 gating-field arm: launcher splits the f64 literal across (threshold_bits, threshold_bits_high) and the shader walks f64 cells with two-u32 PSB loads reassembled into a u64. Reverting the arm decodes the threshold as 0.0 and over-counts gate-passing cells f64-capable adstack archs
test_adstack_static_bound_expr_resolve_length_walks_full_ndarray SPIR-V launcher's resolve_length walking the full ndarray flat product instead of capping at kMaxNumThreadsGridStrideLoop = 131072. Pre-fix the reducer counts 0 gate-passing cells past the cap and the runtime sync raises the divergence-overflow signal qd.metal, qd.vulkan
test_adstack_overflow_raises / ..._reset_after_catch end-to-end overflow signal handling on qd.sync() raising RuntimeError("[Aa]dstack overflow") and clearing the flag for the next launch every adstack-supporting arch

Side-effect audit

Concern Where checked Verdict
Offline cache key analysis/offline_cache_util.cpp, analysis/gen_offline_cache_key.cpp This PR adds no new IR fields that participate in correctness; the heap layout is a host-side launcher decision driven by per-launch SizeExpr eval. Cache key unchanged.
Stmt clone / serialization QD_STMT_DEF_FIELDS(...) on AdStackAllocaStmt Auto-covered.
IR equality (same_statements) / WholeKernelCSE field_manager.equal() on AdStackAllocaStmt Auto-covered.
Combined-heap field still on LLVMRuntime runtime->adstack_heap_buffer, _size, _per_thread_stride Field retained as a transitional fallback; not allocated by the launcher and not dereferenced by freshly-compiled kernels. Removing the field would invalidate offline-cache-loaded kernels that predate the split, so it stays for at least one cycle.
Dispatch cap on non-adstack tasks runtime/cuda/kernel_launcher.cpp, runtime/amdgpu/kernel_launcher.cpp Cap is gated on !task.ad_stack.allocas.empty(); tasks without an adstack keep saturating_grid_dim unchanged.
Debug-build alloca-site init for lazy float allocas codegen_llvm.cpp::visit(AdStackAllocaStmt), visit(Block*) stack_init for lazy float allocas is emitted at the LCA block (after the row claim), not at the offload root where row_id_var is still UINT32_MAX. Release build uses the per-stack count alloca and is unaffected.
Capacity = 0 underflow on the LCA-block clamp codegen_llvm.cpp::emit_ad_stack_row_claim_llvm Explicit select(capacity == 0, 0, capacity - 1) so the clamp upper bound stays in-bounds when the reducer reports zero matches; the launcher floors the heap allocation at one row precisely so the single-slot fallback is always backed by real storage.
Device sizer offset writes runtime/llvm/runtime_module/runtime.cpp::runtime_eval_adstack_size_expr out_offsets[i] is a per-kind byte offset within the float-only or int-only slice (mirrors the host-eval branch and the SPIR-V sizer's OpSelect). Earlier drafts wrote the combined prefix sum, which would alias float and int slots on any kernel mixing both kinds with at least one ndarray-leaf size_expr.
MSVC linker dependency quadrants/codegen/llvm/CMakeLists.txt llvm_codegen now explicitly links against spirv_codegen for the compile_snode_structs call the SNode-backed-gate descriptor resolver makes. Linux / Mac satisfied this transitively via the final shared-module link order; MSVC's linker requires the explicit dep.
SPIR-V f64 gating-field arm codegen/spirv/adstack_bound_reducer_shader.{h,cpp}, runtime/gfx/adstack_bound_reducer_launch.cpp AdStackBoundReducerParams carries field_dtype_is_double + threshold_bits_high; the launcher splits *reinterpret_cast<const uint64_t *>(&literal_f64) across the lo / hi u32 pair and the shader walks f64 cells via psb_load_u64_pair (two adjacent 4-byte u32 loads + register reassembly) into an f64 OpFOrd* compare arm. Devices without spirv_has_float64 keep the f64 inner arm code-stripped at shader build time and the launcher's matched-task filter drops f64 captures back to dispatched-threads worst-case sizing.
LLVM CUDA / AMDGPU bound_count_length shape walk runtime/cuda/kernel_launcher.cpp, runtime/amdgpu/kernel_launcher.cpp The shape walk uses ctx.get_struct_arg_host<int32_t>(indices), NOT get_struct_arg. launch_llvm_kernel swaps ctx_->arg_buffer to a device pointer (cuda:269-274 / amdgpu:230-235) before launch_offloaded_tasks runs, so a plain get_struct_arg would dereference device memory from the host (SIGSEGV / CUDA_ERROR_ILLEGAL_ADDRESS on drivers without HMM, garbage flat_len on HMM-capable setups). The host backing buffer arg_buffer_ stays host-resident across the swap.
Cap-missing devices: AdStackBoundRowCapacity buffer runtime/gfx/adstack_bound_reducer_launch.cpp::dispatch_adstack_bound_reducers Capacity-buffer alloc + UINT32_MAX fill is hoisted ABOVE the PSB / Int64 capability gates so cap-missing devices (pre-Apple7 Metal, Vulkan-1.1 mobile drivers without shaderInt64 / bufferDeviceAddress) still receive inert defaults the codegen clamp leaves alone. Without the hoist the bind path routes kDeviceNullAllocation to the descriptor slot, robustBufferAccess returns 0, the divergence-overflow OpAtomicUMax fires unconditionally and every adstack-bearing kernel hard-errors at sync.
last_observed_rows_per_task_ heap-bind tertiary fallback runtime/gfx/runtime.cpp heap-bind path Tasks the reducer did not pre-count (no captured bound_expr, compound gate predicate, capability-missing device) size from ceil(last_observed * 1.5) instead of dispatched_threads worst case when a prior synchronize() snapshot recorded the LCA claim count for the same task name. The 1.5x cushion absorbs run-to-run variance without forcing amortized-doubling reallocation on every modest workload uplift. The int heap stays at the dispatched-threads worst case because int allocas use the eager linear_tid * stride_int mapping.
snode_resolver tree-id scan bound codegen/llvm/codegen_llvm.cpp::init_offloaded_task_function The scan is bounded with prog->get_snode_tree_size() and continues past nullptr slots (recycled tree-id holes from free_snode_tree_ids_). Program::get_snode_root is a raw snode_trees_[id]->root() with no bounds check, so an unbounded loop is std::vector::operator[] UB on stale-IR / cross-program / offline-cache-restore paths.
SPIR-V resolve_length walks full ndarray runtime/gfx/adstack_bound_reducer_launch.cpp::resolve_length_ndarray Walks the gating ndarray's full flat element product through host_ctx.get_struct_arg<int32_t>(indices) instead of capping at advisory_total_num_threads. Pre-fix kernels with N > 131072 (range_for cap) under-counted gate-passing cells past the cap; the float adstack heap was sized to the truncated count and the codegen-emitted clamp aliased every later gated iteration into the smaller row range.

@hughperkins
Copy link
Copy Markdown
Collaborator

7x 🔥

@duburcqa duburcqa changed the title [AutoDiff] Cut reverse-mode adstack VRAM ~7x on CPU / CUDA / AMDGPU [AutoDiff] Cut reverse-mode adstack memory usage on all backends Apr 30, 2026
@duburcqa duburcqa changed the title [AutoDiff] Cut reverse-mode adstack memory usage on all backends [AutoDiff] Cut reverse-mode adstack memory usage ~7x on all backends Apr 30, 2026
@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from 877298a to 225d087 Compare April 30, 2026 18:06
@duburcqa
Copy link
Copy Markdown
Contributor Author

@claude review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 877298a8f9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/cuda/kernel_launcher.cpp Outdated
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp
Comment thread quadrants/runtime/cpu/kernel_launcher.cpp Outdated
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/runtime/cpu/kernel_launcher.cpp Outdated
Comment thread quadrants/runtime/cpu/kernel_launcher.cpp Outdated
Comment thread quadrants/transforms/static_adstack_analysis.cpp
Comment thread quadrants/runtime/gfx/adstack_bound_reducer_launch.cpp Outdated
Comment thread quadrants/runtime/cuda/kernel_launcher.cpp Outdated
Comment thread quadrants/runtime/gfx/adstack_bound_reducer_launch.cpp
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/runtime/llvm/llvm_runtime_executor.cpp
Comment thread quadrants/runtime/gfx/adstack_bound_reducer_launch.cpp
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated
Comment thread quadrants/codegen/spirv/adstack_bound_reducer_shader.cpp Outdated
Comment thread quadrants/runtime/cuda/kernel_launcher.cpp Outdated
@github-actions
Copy link
Copy Markdown

Coverage Report (9c1504d5d)

File Coverage Missing
🔴 python/quadrants/_tensor_wrapper.py 0% 208-209
🔴 python/quadrants/lang/_ndarray.py 33% 90,106
🔴 python/quadrants/lang/field.py 22% 93-97,513,530
🔴 python/quadrants/lang/matrix.py 50% 1293
🟢 tests/python/test_adstack.py 90% 3493-3495,3497,3499-3500,3502-3509,3511-3514,3516-3518,3520-3524,3526-3527

Diff coverage: 86% · Overall: 74% · 287 lines, 40 missing

Full annotated report

duburcqa added 16 commits May 1, 2026 07:26
Comment thread quadrants/codegen/llvm/codegen_llvm.cpp
Comment thread docs/source/user_guide/debug.md Outdated
Comment thread quadrants/analysis/offline_cache_util.cpp
@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from b973d26 to 42d01fc Compare May 1, 2026 15:52
Comment thread docs/source/user_guide/autodiff.md Outdated
Quadrants implements autodiff at compile time: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework. Forward mode and reverse mode are available on every backend Quadrants targets: x64 / arm64 CPU, CUDA, AMDGPU, Metal, and Vulkan. Reverse-mode AD through dynamic loops (described further down) is currently behind an opt-in `ad_stack_experimental_enabled=True` flag.
Quadrants implements autodiff at compile time: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework. Forward mode and reverse mode are available on every backend Quadrants targets: x64 / arm64 CPU, CUDA, AMDGPU, Metal, and Vulkan.

**Recommendation.** Reverse-mode AD through dynamic loops (described further down) is currently behind an opt-in `ad_stack_experimental_enabled=True` flag at `qd.init`. We strongly recommend systematically enabling this flag as it is required for any reverse-mode kernel with a runtime-bounded loop carrying a non-linear primal, and free for every other kernel. See [the cost breakdown](./init_options.md#ad_stack_experimental_enabled) for details.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If you are using autodiff at all, we recommend"

@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch 2 times, most recently from 47fba3a to 4ae4b0d Compare May 1, 2026 15:54
…strongly recommend' to 'if you are using autodiff at all, we recommend'
@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from 4ae4b0d to 24ed143 Compare May 1, 2026 15:56
….md label drift, in-LCA-block stack_init defense
Comment thread quadrants/runtime/gfx/runtime.cpp
Comment thread quadrants/codegen/llvm/codegen_llvm.h Outdated
Comment thread quadrants/runtime/cuda/kernel_launcher.cpp
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Coverage Report (2c8d275af)

File Coverage Missing
🟢 tests/python/test_adstack.py 94% 3750-3752,3754,3756-3757,3759-3766,3768-3771,3773-3775,3777-3781,3783-3784

Diff coverage: 94% · Overall: 74% · 509 lines, 28 missing

Full annotated report

@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from 528f9f4 to c5ca25d Compare May 1, 2026 17:28
…IR-V dispatch at 65536, pin compound-index tests
@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from c5ca25d to 1c0011d Compare May 1, 2026 17:36
Comment thread quadrants/transforms/static_adstack_analysis.h Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/runtime/gfx/adstack_bound_reducer_launch.cpp
Comment thread quadrants/runtime/llvm/llvm_adstack_lazy_claim.cpp Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Coverage Report (1c0011d70)

File Coverage Missing
🟢 tests/python/test_adstack.py 99% 3344,3809-3814

Diff coverage: 99% · Overall: 74% · 546 lines, 7 missing

Full annotated report

@hughperkins
Copy link
Copy Markdown
Collaborator

(totally orthogonal to your own PR, I feel like my kernel coverage is somehow not doing coverage on non-kernels 🤔 Thats a bug I should fix. It is supposed to.)

@hughperkins
Copy link
Copy Markdown
Collaborator

The line wrap CI flags seem valid I think? eg:

Screenshot 2026-05-01 at 15 21 59

@duburcqa duburcqa force-pushed the duburcqa/sparse_adstack_heap branch from a183fab to e951423 Compare May 1, 2026 20:26
@hughperkins
Copy link
Copy Markdown
Collaborator

checklist:

  • doc updated
  • Genesis benchmarks neutral
  • Genesis unit tests passing

=> ok to merge

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Coverage Report (845bd82c6)

File Coverage Missing
🟢 tests/python/test_adstack.py 98% 3361-3366,3812-3817

Diff coverage: 98% · Overall: 74% · 545 lines, 12 missing

Full annotated report

@duburcqa duburcqa merged commit cda2944 into main May 1, 2026
54 checks passed
@duburcqa duburcqa deleted the duburcqa/sparse_adstack_heap branch May 1, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants