[AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU) by duburcqa · Pull Request #537 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-21T06:44:10Z

Heap-backed adstack on LLVM backends (CPU / CUDA / AMDGPU)

Replaces the per-thread worker-stack allocation of AdStackAllocaStmt with a host-grown shared heap, lifting the 256 KB CPU thread-stack budget that the prior create_entry_block_alloca path imposed on deep reverse-mode kernels. SPIR-V side is the following PR.

TL;DR

Prior behaviour on CPU / CUDA / AMDGPU: every AdStackAllocaStmt lowered to a function-scope alloca at the task's entry block, so every adstack lived on the LLVM stack frame (= worker-thread stack on CPU, per-thread local memory on GPU). A kernel with many loop-carried values at default_ad_stack_size=256 crossed the worker-thread limit and silently corrupted adjacent stack memory; the previous PR in the stack added a 256 KB codegen-time guard that hard-aborted those kernels.

This PR moves the storage off the stack:

Codegen: a pre-scan of each offloaded task body computes per-task {ad_stack_offsets_, ad_stack_per_thread_stride_}, and visit(AdStackAllocaStmt) emits base = runtime->adstack_heap_buffer + linear_thread_idx * stride + offset instead of an alloca. Base is loaded once in entry_block and reused.
Runtime: LlvmRuntimeExecutor::ensure_adstack_heap(needed_bytes) grows the per-runtime slab via amortised doubling, publishes the new pointer/size into runtime->{adstack_heap_buffer, adstack_heap_size} by caching the two device field-pointer addresses on first grow and writing through them on every subsequent grow.
Launchers: CPU / CUDA / AMDGPU kernel launchers call ensure_adstack_heap(per_thread_stride * num_threads) before each task launch. Dynamic-bound range-for tasks resolve num_threads by reading begin / end from runtime->temporaries via a host-side DtoH memcpy.
CUDA graphs: rejected at launch when any task has per_thread_stride > 0, because graph baking precludes the host-side ensure_adstack_heap step between dispatches.
default_ad_stack_size exposed via qd.init(); raised from 32 → 256 now that the per-thread on-chip / worker-stack budget no longer caps it.
The codegen budget guard from the previous PR is removed; the stack frame no longer carries adstack storage, so the 256 KB ceiling is obsolete.

Nothing changes for kernels that don't enable the adstack extension.

Why

Prior to this PR, a kernel like a reverse-mode articulated-body dynamics step in Genesis hit the 256 KB CPU-stack budget at modest capacities (4 loop-carried f64 variables × 4096 entries × 16 bytes each already crosses it). The two alternatives — ship with default_ad_stack_size capped at a value small enough to fit on every worker stack, or ask users to lower ad_stack_size per-kernel — either regress correctness on large kernels or force tuning noise on the user. Moving the storage off-stack removes the constraint entirely: per-thread slice size is bounded only by num_threads * per_thread_stride and the driver's allocator.

Changes

Codegen (`quadrants/codegen/llvm/codegen_llvm.{h,cpp}`, `llvm_compiled_data.h`)

TaskCodeGenLLVM grows three new per-task fields:

ad_stack_per_thread_stride_ — sum of AdStackAllocaStmt::size_in_bytes() (aligned up to 8) for every adstack in the task.
ad_stack_offsets_ — map from each alloca stmt to its offset within the per-thread slice.
ad_stack_heap_base_llvm_ — cached SSA value of the heap base pointer, emitted once in entry_block.

init_offloaded_task_function pre-scans the task body before any codegen runs and populates the first two, so that later sibling allocas never shift an earlier alloca's offset out from under a cached SSA pointer.

visit(AdStackAllocaStmt) now emits:

base   = LLVMRuntime_get_adstack_heap_buffer(runtime)  // cached in entry_block
tid_64 = zext(linear_thread_idx(context))              // i32 → i64
slice  = tid_64 * stride                               // widened mul to avoid i32 overflow
ptr    = base + slice + offset

linear_thread_idx is the arch-appropriate invocation id (RuntimeContext::cpu_thread_id on CPU; block_idx * block_dim + thread_idx on CUDA / AMDGPU), matching how rand_states is indexed.

The old 256 KB function-scope budget guard (introduced in the previous PR) is deleted; its ad_stack_fn_scope_bytes_ accumulator is gone too. Heap-backed storage makes the ceiling irrelevant.

OffloadedTask gains an AdStackSizingInfo ad_stack sub-struct that propagates sizing to the host launcher: per_thread_stride, static_num_threads, dynamic_gpu_range_for, plus const values and gtmps byte offsets for range-for begin / end.

Per-arch codegen tweaks

codegen_cpu.cpp — fills current_task->ad_stack with the pre-scanned stride and sets static_num_threads = cpu_thread_id_range (CPU thread count is known at compile time).
codegen_cuda.cpp — fills current_task->ad_stack.static_num_threads = grid_dim * block_dim for const-bound tasks, and marks dynamic_gpu_range_for = true + records begin_offset_bytes / end_offset_bytes / begin_const_value / end_const_value for dynamic range-for tasks so the launcher can resolve the actual iteration count at launch time.
codegen_amdgpu.cpp — same as CUDA.

Runtime (`llvm_runtime_executor.{h,cpp}`, `runtime.cpp`)

LLVMRuntime gains two new fields: Ptr adstack_heap_buffer = nullptr; u64 adstack_heap_size = 0;. These are read by every adstack-backed task on the device side; the host writes to them through the cached field-pointer addresses.

LlvmRuntimeExecutor::ensure_adstack_heap(needed_bytes):

No-op if needed_bytes == 0 || needed_bytes <= adstack_heap_size_.
Otherwise new_size = max(needed_bytes, 2 * adstack_heap_size_) (amortised doubling).
Allocates through the per-arch driver (llvm_device()->allocate_memory), wraps in a DeviceAllocationGuard.
On first grow, calls a one-shot runtime query to fetch the device addresses of runtime->adstack_heap_buffer and runtime->adstack_heap_size, caches them.
Publishes via memcpy_host_to_device (CUDA / AMDGPU) or plain pointer stores (CPU) against the cached addresses — no per-grow kernel launch.
Releases the previous DeviceAllocationGuard via move-assignment. Safety of the release (see the detailed block comment in the .cpp and the matching field comment in the .h): CPU uses std::free (trivially safe); CUDA cuMemFree_v2 synchronises before returning; AMDGPU dealloc_memory pools through CachingAllocator::release without sync, and cross-launch safety on AMDGPU is provided by the synchronous hipFree(context_pointer) at the tail of amdgpu::KernelLauncher::launch_llvm_kernel (the latent-fix in [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) #536).

get_runtime_temporaries_device_ptr() — cached lookup of runtime->temporaries, used by the GPU launchers to read back dynamic range-for bounds.

Per-arch launchers

runtime/cpu/kernel_launcher.{h,cpp} — Context gains a parallel ad_stack_needed_bytes vector, precomputed at register time (CPU sizing is static). launch_offloaded_tasks calls ensure_adstack_heap per task.
runtime/cuda/kernel_launcher.cpp — adds resolve_num_threads(task) which DtoH-memcpys begin / end from runtime->temporaries for dynamic range-for tasks; calls ensure_adstack_heap per task.
runtime/amdgpu/kernel_launcher.cpp — same as CUDA.
runtime/cuda/graph_manager.cpp — hard-errors graph=True on kernels where any task has per_thread_stride > 0. Graph baking precludes host-side intervention between dispatches.

Capacity knob (`compile_config.h`, `python/export_lang.cpp`)

default_ad_stack_size raised from 32 to 256.
Exposed as a qd.init() kwarg. The comment block is rewritten to reflect the new heap-backing reality.
ad_stack_size (per-stack explicit capacity) is unchanged.

Docs (`docs/source/user_guide/autodiff.md`)

Drops the "SPIR-V on-chip cap" limitation from the known-limitations list (that was about the prior Function-scope SPIR-V path; with the SPIR-V heap landing in the next PR, it's gone too). Adds a "Tuning the capacity" section explaining default_ad_stack_size vs ad_stack_size and the K+2-pushes-per-iteration rule for picking N.

Tests (`tests/python/test_adstack.py`)

Heap-specific additions:

test_adstack_heap_grow_on_demand — two launches at increasing capacity pinpoint that the amortised-doubling grow path fires and the second launch reuses the bigger slab.
test_adstack_heap_backed_exceeds_old_threadstack_budget — a kernel whose per-thread adstack bytes exceed the pre-PR 256 KB ceiling now compiles and runs correctly.
test_adstack_cuda_graph_rejected_with_adstack — graph=True on an adstack kernel raises.

Side-effect audit

Concern	Where checked	Verdict
Offline cache key	`AdStackSizingInfo` fields serialised via the existing `OffloadedTask` key hash	Auto-covered, kernels without adstack get zero stride and hash identically
Stmt clone / serialization	No new `Stmt` fields (all sizing lives in codegen / `OffloadedTask`)	N/A
CUDA graphs	Hard rejection with clear error message when any task has `per_thread_stride > 0`	Fails loudly, not silently
AMDGPU release lifecycle	Relies on #536's synchronous `hipFree(context_pointer)` tail	Cross-launch invariant spelled out in both `.cpp` and `.h` comments
Non-adstack kernels	`per_thread_stride == 0` path short-circuits the launcher-side `ensure_adstack_heap`	Zero-cost for kernels that don't enable the extension

Stack

Split 2/3 of the former "heap-backed adstack" PR. Based on #536 (latent fixes). Followed by #493 (SPIR-V heap).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a619832bc4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 quadrants/program/compile_config.h:54-68 — The comment added to compile_config.h falsely claims SPIR-V now uses heap-backed StorageBuffers, but SPIR-V heap-backing is explicitly deferred to PR #493; SPIR-V still allocates adstacks as Function-scope per-thread on-chip memory bounded by the driver shader-compiler limit. The 32 to 256 default increase therefore multiplies per-thread private memory 8x on Metal/Vulkan, which can cause pipeline-creation failures for kernels that previously compiled correctly. The companion docs change compounds this by removing the only warning about the SPIR-V on-chip cap and replacing it with advice to bump default_ad_stack_size, which on SPIR-V causes shader-compiler rejection rather than heap growth.
Extended reasoning...

What is wrong

compile_config.h lines 57-64 add a comment justifying the 32 to 256 default increase: "Both backends now heap-back the primal/adjoint slots: SPIR-V uses per-dispatch StorageBuffers (BufferType::AdStackHeapFloat + AdStackHeapInt, sliced by invocation)". This claim is factually incorrect for the current PR. The PR description itself explicitly states "SPIR-V side is the following PR", meaning PR #493 has not landed yet. After this PR merges, SPIR-V (Metal and Vulkan) still allocates every AdStackAllocaStmt using ir_->alloca_variable(arr_type) with spv::StorageClassFunction - Function-scope per-thread on-chip private memory - exactly as before.

Concrete code path

spirv_codegen.cpp:2221-2223 (unchanged in this diff) visits AdStackAllocaStmt and calls alloca_variable() for count_var, primal_arr, and adjoint_arr. There is no BufferType::AdStackHeapFloat or BufferType::AdStackHeapInt anywhere in the SPIR-V codegen, and no SPIR-V files appear in the list of changed files. The 8x raise is therefore applied to the SPIR-V path unconditionally by ControlFlowGraph::determine_ad_stack_size() in transforms/determine_ad_stack_size.cpp, which is arch-agnostic and falls back to default_ad_stack_size for any stack whose worst-case trip count cannot be statically proven.

Why existing code does not prevent it

The SPIR-V codegen has no guard that caps AdStackAllocaStmt::max_size against a per-thread on-chip budget - that responsibility fell on the deliberately-conservative 32-slot default. Removing that conservatism by raising the default 8x while the SPIR-V heap path is absent eliminates the only protection.

Impact

For a kernel with 4 f32 loop-carried variables and a dynamic loop whose trip count the compiler cannot prove:
- Old default (32): per-thread Function-scope demand = 4 * (8 + 3224) bytes ~= 1 KB
- New default (256): per-thread Function-scope demand = 4 * (8 + 25624) bytes ~= 8 KB
Apple Metal's MSL shader compiler rejects pipelines whose per-thread private-variable footprint exceeds its budget (typically 4-16 KB on consumer hardware), and the existing test test_adstack_shader_compile_failure_raises with ad_stack_size=65536 confirms this failure mode exists. A kernel that compiled and ran correctly at the 32-slot default may now fail at kernel-launch time with a pipeline-creation error whose message does not mention adstack size.

Step-by-step proof
1. User calls qd.init(arch=qd.metal, ad_stack_experimental_enabled=True) - default_ad_stack_size is now 256 after this PR.
2. A kernel has 4 f32 loop-carried variables under a range(n[None]) whose bound comes from a field.
3. determine_ad_stack_size cannot prove the worst-case trip count, sets max_size = 256 for each adstack.
4. SPIR-V codegen emits 4 x OpVariable StorageClassFunction arrays of 8x256 bytes each = 8192 bytes of per-thread on-chip private memory.
5. Metal MSL compiler rejects the pipeline; compute.grad() raises RuntimeError: Failed to create pipeline with no hint that adstack sizing is the cause.
6. The user consults the updated docs, which now say to bump default_ad_stack_size when they see a SPIR-V overflow. Doing so makes the pipeline failure worse, not better.
Docs regression

The removed Known Limitations bullet was the only explicit warning that on SPIR-V backends the adstack is allocated as per-thread on-chip memory capped by the driver. The new Tuning the capacity section and the Memory cost statement that "The buffer grows on demand to match the largest size any launch has needed so far" are presented without qualification and are false for SPIR-V - on SPIR-V the capacity is baked into the compiled shader at compile time, there is no grow-on-demand behavior, and bumping the value risks compile-time pipeline rejection rather than graceful heap growth.

Fix

Keep default_ad_stack_size at 32 until PR #493 lands. Update the comment in compile_config.h to remove the false SPIR-V StorageBuffer claim. Restore the Known Limitations bullet about the SPIR-V on-chip cap so Metal/Vulkan users receive the correct guidance.

hughperkins · 2026-04-21T18:46:30Z

+
+**Tuning the capacity.** Two `qd.init()` knobs control adstack sizing:
+
+- `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess.


lets have the units please
default_ad_stack_size_mb

if this is in units, or int32s, or something then maybe something like default_ad_stack_size_count or default_ad_stack_size_units or default_ad_stack_size_i32s`?

what happens if there is an i64 in the loop?

hughperkins · 2026-04-21T18:46:38Z

+**Tuning the capacity.** Two `qd.init()` knobs control adstack sizing:
+
+- `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess.
+- `ad_stack_size=N` (default `0 = adaptive`): a hard override that forces every adstack in the program to exactly `N` slots, regardless of what the compiler proved. Prefer this knob only when a targeted experiment needs uniform sizing (e.g. stress-testing the runtime heap path).


units ad_stack_size_mb

hughperkins · 2026-04-21T18:46:46Z

+- `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess.
+- `ad_stack_size=N` (default `0 = adaptive`): a hard override that forces every adstack in the program to exactly `N` slots, regardless of what the compiler proved. Prefer this knob only when a targeted experiment needs uniform sizing (e.g. stress-testing the runtime heap path).
+
+**How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes:


default_ad_stack_size_mb

hughperkins · 2026-04-21T18:47:15Z

+
+**How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes:
+
+- A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly.


dont we need to multiply by 4?

by the way, why + 2?

hughperkins · 2026-04-21T18:47:24Z

+**How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes:
+
+- A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly.
+- Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule.


hughperkins · 2026-04-21T18:47:35Z

+
+- A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly.
+- Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule.
+- A single dynamic `for i in range(a[None])`: `N >= a_max + 2`.


hughperkins · 2026-04-21T18:48:28Z

+- Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule.
+- A single dynamic `for i in range(a[None])`: `N >= a_max + 2`.
+
+**Memory cost.** The adstack pipeline allocates one small scratch buffer per loop-carried variable that the reverse pass has to remember. For example, a kernel whose dynamic loop reads and updates one float accumulator needs 1 adstack; a kernel whose loop updates four different floats needs 4. Integer counters and boolean branch flags used by the reverse pass also count (typically one each per dynamic `if` or nested loop). The total memory Quadrants allocates across all those buffers is roughly


why do we need to qualify with "that the reverse pass has to remember."? Are there loop-carried variables tha the reverse pass does not have to remember?

hughperkins · 2026-04-21T18:51:08Z

+**Memory cost.** The adstack pipeline allocates one small scratch buffer per loop-carried variable that the reverse pass has to remember. For example, a kernel whose dynamic loop reads and updates one float accumulator needs 1 adstack; a kernel whose loop updates four different floats needs 4. Integer counters and boolean branch flags used by the reverse pass also count (typically one each per dynamic `if` or nested loop). The total memory Quadrants allocates across all those buffers is roughly
+
+```
+num_threads * stack_size * bytes_per_element * num_loop_carried_variables


can you clarify where num_threads suddenly springs from? I'm guessing it's from the top level for loop, but you don't introduce t his I think. Or at least, I dont remember your introducing this.

hughperkins · 2026-04-21T18:52:10Z

+num_threads * stack_size * bytes_per_element * num_loop_carried_variables
+```
+
+where `bytes_per_element` depends on the element type and the backend. On the LLVM backends (CPU / CUDA / AMDGPU) each adstack slot stores both a primal and an adjoint value, so f32 costs 8, i32 costs 8, and bool costs 2 bytes per slot. On the SPIR-V backends (Metal / Vulkan) integer adstacks only store the primal (the reverse pass does not accumulate integer adjoints), and bool is widened to i32 at storage time because SPIR-V has no defined layout for `OpTypeBool`, so f32 costs 8, i32 costs 4, and bool costs 4 bytes per slot. The buffer lives on the device on GPU and in host RAM on CPU. `num_threads` is the number of threads the kernel actually dispatches, not a worst-case grid: on CPU this is the thread pool size (tens of threads), so the memory footprint stays small; on GPU it is the dispatched ndrange. The buffer grows on demand to match the largest size any launch has needed so far and is then reused across subsequent launches, so you do not need to reserve memory up front.


lets ditch the where, otherwsie no room to breathe. Seems lik a bunch new concepts here, so lets give the reader time to breathe. its a new paragraph.

this all seesm like way too much detail. Do we really need to know this to use autodiff? Move it to an 'advanced' or 'under the hood' section.

keep 'The buffer grows on demand to match the largest size any launch has needed so far and is then reused across subsequent launches, so you do not need to reserve memory up front.'

hughperkins · 2026-04-23T13:39:07Z

 **Problem.** Reverse-mode AD through a dynamic loop (one whose trip count is not known at compile time) needs to recover the primal value at each iteration when walking the loop backwards. Without that, the chain-rule steps read a stale value and the gradients come out silently wrong. Static-unrolled (`qd.static(range(...))`) loops are not affected because every iteration becomes its own inlined block at compile time.

-**How Quadrants does it.** An opt-in compiler pipeline called the *autodiff stack* (*adstack*) allocates a per-variable stack alongside each loop-carried primal. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. It is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases (the autodiff pass rejects a non-static range that would otherwise lose its primal); in the narrow cases where the kernel compiles anyway, the reverse pass reads a stale value for every iteration and the gradients come out wrong but non-zero.
+**How Quadrants does it.** An opt-in compiler pipeline called the *autodiff stack* (*adstack*) allocates a per-variable stack alongside each primal that is updated inside the loop and therefore changes from one iteration to the next. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. It is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases (the autodiff pass rejects a non-static range that would otherwise lose its primal); in the narrow cases where the kernel compiles anyway, the reverse pass reads a stale value for every iteration and the gradients come out wrong but non-zero.


break the first sentence into two. so each setnence just states a single concept.

" the autodiff stack (adstack) " => "adstack". We only ever refer to it as adstack, so let's say adstack is its name. We can put the long form in brakcets i youf want "called the adstack (short for "(a)uto(d)iff (stack)")"

I think remove "and therefore changes from one iteration to the next"

" It is opt-in" => "adstack is opt-in"

"Running without it when it is needed raises a QuadrantsCompilationError in most cases (the autodiff pass rejects a non-static range that would otherwise lose its primal);" => nice 🙌

"in the narrow cases where the kernel compiles anyway, the reverse pass reads a stale value for every iteration and the gradients come out wrong but non-zero." => would be nice to get rid of such exceptional cases. Do we know what they are? Can we document them?

hughperkins · 2026-04-23T13:52:48Z

I know it's not part of your changes in this pr, but it's part of your set of prs overall, and anyway, I think it would be good to address:

I couldnt understand this sentence. I mean, I probably could if I thought about it a long time, bu I think it would be good to explain step by step, so people can just read fluidly, and undersatnd it, without grinding to a halt, and having to work through stuff in their head.

hughperkins · 2026-04-23T13:54:33Z

- A loop-carried dependency (a variable read, written, and read again across iterations, e.g. `v = v * 0.95 + 0.01`).
+- A loop-carried variable - one whose value is carried forward from each iteration into the next, e.g. `v = v * 0.95 + 0.01`.
 - A local variable used as an index into a global field.
 - Non-linear ops (`sin`, `cos`, `exp`, `sqrt`, `tanh`, `pow`, ...) whose derivative depends on the primal value at that iteration.


can we have a counter-example of a non-linear op that doesnt need adstack, in the 'do not need it' section below please.

hughperkins · 2026-04-23T13:59:31Z

+| nested `for i in range(a[None]): for j in range(b[None])` | `a_max * b_max + 2` |
+| `qd.ndrange(n, m)` with field-derived `n`, `m` | `n_max * m_max + 2` |
+
+At `max_n_dofs_per_entity = 16`, a 16 x 16 ndrange hits the default exactly (`256`).


mixtue of fonts is visually distracting. either put backticks around 16 x 16 too, or remove from = 16

hughperkins · 2026-04-23T14:00:55Z

+
+At `max_n_dofs_per_entity = 16`, a 16 x 16 ndrange hits the default exactly (`256`).
+
+**Memory footprint.** The pipeline allocates one scratch buffer per piece of reverse-pass state. That count includes every loop-carried variable the reverse pass has to replay, plus any integer counter and any boolean branch flag it has to read back. Total memory across all buffers is approximately


whats a 'piece'? per stack carried variable?

oh you define it next 🤔

Maybe we just avoid the issue by avoiding having to use this noun at all? for example

"The pipeline allocates a scratch buffer for each loop-carried variable, and also for any loop counter, and any boolean branch flags." ?

Still, I'm unclear about this 'integer counter' and 'boolean branch flag'. You havent defined them before. Could you define these in a previous paragraph please.

hughperkins · 2026-04-23T14:07:32Z

+
+`num_threads` is the number of threads the kernel actually dispatches. On CPU that is the thread-pool size, typically tens. On GPU it is the full ndrange. `bytes_per_slot` scales with the element's storage size and the backend; see the two tables below.
+
+On LLVM backends (CPU / CUDA / AMDGPU), each adstack slot stores both a primal and an adjoint value, so `bytes_per_slot = 2 * sizeof(T)` for every element type `T`. Common cases:


hughperkins · 2026-04-23T17:39:35Z

+At `max_n_dofs_per_entity = 16`, a `16 x 16` ndrange hits the default exactly (`256`).

-**Memory footprint.** The pipeline allocates one scratch buffer per piece of reverse-pass state. That count includes every loop-carried variable the reverse pass has to replay, plus any integer counter and any boolean branch flag it has to read back. Total memory across all buffers is approximately
+**Memory footprint.** With one scratch buffer per adstack (see above), the total memory cost depends on two further quantities. The first is the number of threads the kernel actually dispatches, which we call `num_threads`. On CPU that is the thread-pool size, typically tens. On GPU it is the full ndrange. The second is `bytes_per_slot`, which scales with the element's storage size and the backend; the two tables below work through its concrete values. Total memory across all buffers is then approximately:


have we used the term 'element' thus far? Have we defined it?

hughperkins · 2026-04-23T19:29:56Z

Checklist:

Doc updated, and looks good to me (readable, makes sense, a few variable names that seem questionable, but they are pre-existing, and should be addressed seprately from this PR, to avoid inflating the PR)

=> ok to merge

…ack stored type

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/program/compile_config.h Outdated

Comment thread quadrants/runtime/cuda/kernel_launcher.cpp Outdated

This was referenced Apr 21, 2026

[AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) #536

Merged

[AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) #493

Merged

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from 2c96718 to 8c2cc51 Compare April 21, 2026 06:59

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from a619832 to cd038de Compare April 21, 2026 06:59

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from 8c2cc51 to 0bae8ab Compare April 21, 2026 07:19

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from cd038de to 04d17cd Compare April 21, 2026 07:19

claude Bot reviewed Apr 21, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from 0bae8ab to eed8931 Compare April 21, 2026 08:18

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 04d17cd to 4f43e77 Compare April 21, 2026 08:18

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from eed8931 to 3d25bd7 Compare April 21, 2026 08:36

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 4f43e77 to 10e5547 Compare April 21, 2026 08:36

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from 3d25bd7 to c233cfb Compare April 21, 2026 09:51

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 10e5547 to 35b25a4 Compare April 21, 2026 09:51

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from c233cfb to e6ed50d Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 35b25a4 to 3a3e58c Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from e6ed50d to d9c0752 Compare April 21, 2026 13:24

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 3a3e58c to 98f2246 Compare April 21, 2026 13:24

duburcqa force-pushed the duburcqa/split_adstack_latent_fixes branch from d9c0752 to 1240009 Compare April 21, 2026 14:42

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 98f2246 to c625fc5 Compare April 21, 2026 14:42

duburcqa mentioned this pull request Apr 21, 2026

[Bug]: loss.backward() hangs indefinitely for articulated robots with freejoint + child joints Genesis-Embodied-AI/Genesis#2537

Open

hughperkins reviewed Apr 21, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 6a43a53 to 0a64a05 Compare April 23, 2026 13:17