[AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU)#537
[AutoDiff] Autodiff 12: Heap-backed adstack on LLVM backends (CPU/CUDA/AMDGPU)#537
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a619832bc4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2c96718 to
8c2cc51
Compare
a619832 to
cd038de
Compare
8c2cc51 to
0bae8ab
Compare
cd038de to
04d17cd
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
quadrants/program/compile_config.h:54-68— The comment added to compile_config.h falsely claims SPIR-V now uses heap-backed StorageBuffers, but SPIR-V heap-backing is explicitly deferred to PR #493; SPIR-V still allocates adstacks as Function-scope per-thread on-chip memory bounded by the driver shader-compiler limit. The 32 to 256 default increase therefore multiplies per-thread private memory 8x on Metal/Vulkan, which can cause pipeline-creation failures for kernels that previously compiled correctly. The companion docs change compounds this by removing the only warning about the SPIR-V on-chip cap and replacing it with advice to bump default_ad_stack_size, which on SPIR-V causes shader-compiler rejection rather than heap growth.Extended reasoning...
What is wrong
compile_config.h lines 57-64 add a comment justifying the 32 to 256 default increase: "Both backends now heap-back the primal/adjoint slots: SPIR-V uses per-dispatch StorageBuffers (BufferType::AdStackHeapFloat + AdStackHeapInt, sliced by invocation)". This claim is factually incorrect for the current PR. The PR description itself explicitly states "SPIR-V side is the following PR", meaning PR #493 has not landed yet. After this PR merges, SPIR-V (Metal and Vulkan) still allocates every AdStackAllocaStmt using ir_->alloca_variable(arr_type) with spv::StorageClassFunction - Function-scope per-thread on-chip private memory - exactly as before.
Concrete code path
spirv_codegen.cpp:2221-2223 (unchanged in this diff) visits AdStackAllocaStmt and calls alloca_variable() for count_var, primal_arr, and adjoint_arr. There is no BufferType::AdStackHeapFloat or BufferType::AdStackHeapInt anywhere in the SPIR-V codegen, and no SPIR-V files appear in the list of changed files. The 8x raise is therefore applied to the SPIR-V path unconditionally by ControlFlowGraph::determine_ad_stack_size() in transforms/determine_ad_stack_size.cpp, which is arch-agnostic and falls back to default_ad_stack_size for any stack whose worst-case trip count cannot be statically proven.
Why existing code does not prevent it
The SPIR-V codegen has no guard that caps AdStackAllocaStmt::max_size against a per-thread on-chip budget - that responsibility fell on the deliberately-conservative 32-slot default. Removing that conservatism by raising the default 8x while the SPIR-V heap path is absent eliminates the only protection.
Impact
For a kernel with 4 f32 loop-carried variables and a dynamic loop whose trip count the compiler cannot prove:
- Old default (32): per-thread Function-scope demand = 4 * (8 + 3224) bytes ~= 1 KB
- New default (256): per-thread Function-scope demand = 4 * (8 + 25624) bytes ~= 8 KB
Apple Metal's MSL shader compiler rejects pipelines whose per-thread private-variable footprint exceeds its budget (typically 4-16 KB on consumer hardware), and the existing test test_adstack_shader_compile_failure_raises with ad_stack_size=65536 confirms this failure mode exists. A kernel that compiled and ran correctly at the 32-slot default may now fail at kernel-launch time with a pipeline-creation error whose message does not mention adstack size.
Step-by-step proof
- User calls qd.init(arch=qd.metal, ad_stack_experimental_enabled=True) - default_ad_stack_size is now 256 after this PR.
- A kernel has 4 f32 loop-carried variables under a range(n[None]) whose bound comes from a field.
- determine_ad_stack_size cannot prove the worst-case trip count, sets max_size = 256 for each adstack.
- SPIR-V codegen emits 4 x OpVariable StorageClassFunction arrays of 8x256 bytes each = 8192 bytes of per-thread on-chip private memory.
- Metal MSL compiler rejects the pipeline; compute.grad() raises RuntimeError: Failed to create pipeline with no hint that adstack sizing is the cause.
- The user consults the updated docs, which now say to bump default_ad_stack_size when they see a SPIR-V overflow. Doing so makes the pipeline failure worse, not better.
Docs regression
The removed Known Limitations bullet was the only explicit warning that on SPIR-V backends the adstack is allocated as per-thread on-chip memory capped by the driver. The new Tuning the capacity section and the Memory cost statement that "The buffer grows on demand to match the largest size any launch has needed so far" are presented without qualification and are false for SPIR-V - on SPIR-V the capacity is baked into the compiled shader at compile time, there is no grow-on-demand behavior, and bumping the value risks compile-time pipeline rejection rather than graceful heap growth.
Fix
Keep default_ad_stack_size at 32 until PR #493 lands. Update the comment in compile_config.h to remove the false SPIR-V StorageBuffer claim. Restore the Known Limitations bullet about the SPIR-V on-chip cap so Metal/Vulkan users receive the correct guidance.
0bae8ab to
eed8931
Compare
04d17cd to
4f43e77
Compare
eed8931 to
3d25bd7
Compare
4f43e77 to
10e5547
Compare
3d25bd7 to
c233cfb
Compare
10e5547 to
35b25a4
Compare
c233cfb to
e6ed50d
Compare
35b25a4 to
3a3e58c
Compare
e6ed50d to
d9c0752
Compare
3a3e58c to
98f2246
Compare
d9c0752 to
1240009
Compare
98f2246 to
c625fc5
Compare
|
|
||
| **Tuning the capacity.** Two `qd.init()` knobs control adstack sizing: | ||
|
|
||
| - `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess. |
There was a problem hiding this comment.
lets have the units please
default_ad_stack_size_mb
There was a problem hiding this comment.
if this is in units, or int32s, or something then maybe something like default_ad_stack_size_count or default_ad_stack_size_units or default_ad_stack_size_i32s`?
There was a problem hiding this comment.
what happens if there is an i64 in the loop?
| **Tuning the capacity.** Two `qd.init()` knobs control adstack sizing: | ||
|
|
||
| - `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess. | ||
| - `ad_stack_size=N` (default `0 = adaptive`): a hard override that forces every adstack in the program to exactly `N` slots, regardless of what the compiler proved. Prefer this knob only when a targeted experiment needs uniform sizing (e.g. stress-testing the runtime heap path). |
There was a problem hiding this comment.
units ad_stack_size_mb
| - `default_ad_stack_size=N` (default `256`): the fallback capacity for loops whose trip count the compiler cannot prove statically. Every adstack whose max_size was not deducible shares this value. Prefer tuning this knob, since it only affects the branch where the compiler needed to guess. | ||
| - `ad_stack_size=N` (default `0 = adaptive`): a hard override that forces every adstack in the program to exactly `N` slots, regardless of what the compiler proved. Prefer this knob only when a targeted experiment needs uniform sizing (e.g. stress-testing the runtime heap path). | ||
|
|
||
| **How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes: |
There was a problem hiding this comment.
default_ad_stack_size_mb
|
|
||
| **How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes: | ||
|
|
||
| - A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly. |
There was a problem hiding this comment.
dont we need to multiply by 4?
| **How to pick `default_ad_stack_size`.** The reverse pass of a `K`-iteration dynamic loop emits `K + 2` pushes per adstack (the trip count plus two setup pushes: one for the initial adjoint slot and one for the primal's starting value). Size the default at the flat trip count of the deepest unprovable dynamic loop in the program, plus that headroom. Common shapes: | ||
|
|
||
| - A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly. | ||
| - Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule. |
|
|
||
| - A single `qd.ndrange(n, m)` whose bounds come from a field: worst case is `n_max * m_max` iterations. Pick `N >= n_max * m_max + 2`. At `max_n_dofs_per_entity = 16`, 16 x 16 = 256 hits the default exactly. | ||
| - Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule. | ||
| - A single dynamic `for i in range(a[None])`: `N >= a_max + 2`. |
| - Nested `for i in range(a[None]): for j in range(b[None]):`: worst case is `a_max * b_max`, same rule. | ||
| - A single dynamic `for i in range(a[None])`: `N >= a_max + 2`. | ||
|
|
||
| **Memory cost.** The adstack pipeline allocates one small scratch buffer per loop-carried variable that the reverse pass has to remember. For example, a kernel whose dynamic loop reads and updates one float accumulator needs 1 adstack; a kernel whose loop updates four different floats needs 4. Integer counters and boolean branch flags used by the reverse pass also count (typically one each per dynamic `if` or nested loop). The total memory Quadrants allocates across all those buffers is roughly |
There was a problem hiding this comment.
why do we need to qualify with "that the reverse pass has to remember."? Are there loop-carried variables tha the reverse pass does not have to remember?
| **Memory cost.** The adstack pipeline allocates one small scratch buffer per loop-carried variable that the reverse pass has to remember. For example, a kernel whose dynamic loop reads and updates one float accumulator needs 1 adstack; a kernel whose loop updates four different floats needs 4. Integer counters and boolean branch flags used by the reverse pass also count (typically one each per dynamic `if` or nested loop). The total memory Quadrants allocates across all those buffers is roughly | ||
|
|
||
| ``` | ||
| num_threads * stack_size * bytes_per_element * num_loop_carried_variables |
There was a problem hiding this comment.
can you clarify where num_threads suddenly springs from? I'm guessing it's from the top level for loop, but you don't introduce t his I think. Or at least, I dont remember your introducing this.
| num_threads * stack_size * bytes_per_element * num_loop_carried_variables | ||
| ``` | ||
|
|
||
| where `bytes_per_element` depends on the element type and the backend. On the LLVM backends (CPU / CUDA / AMDGPU) each adstack slot stores both a primal and an adjoint value, so f32 costs 8, i32 costs 8, and bool costs 2 bytes per slot. On the SPIR-V backends (Metal / Vulkan) integer adstacks only store the primal (the reverse pass does not accumulate integer adjoints), and bool is widened to i32 at storage time because SPIR-V has no defined layout for `OpTypeBool`, so f32 costs 8, i32 costs 4, and bool costs 4 bytes per slot. The buffer lives on the device on GPU and in host RAM on CPU. `num_threads` is the number of threads the kernel actually dispatches, not a worst-case grid: on CPU this is the thread pool size (tens of threads), so the memory footprint stays small; on GPU it is the dispatched ndrange. The buffer grows on demand to match the largest size any launch has needed so far and is then reused across subsequent launches, so you do not need to reserve memory up front. |
There was a problem hiding this comment.
lets ditch the where, otherwsie no room to breathe. Seems lik a bunch new concepts here, so lets give the reader time to breathe. its a new paragraph.
There was a problem hiding this comment.
this all seesm like way too much detail. Do we really need to know this to use autodiff? Move it to an 'advanced' or 'under the hood' section.
There was a problem hiding this comment.
keep 'The buffer grows on demand to match the largest size any launch has needed so far and is then reused across subsequent launches, so you do not need to reserve memory up front.'
6a43a53 to
0a64a05
Compare
| **Problem.** Reverse-mode AD through a dynamic loop (one whose trip count is not known at compile time) needs to recover the primal value at each iteration when walking the loop backwards. Without that, the chain-rule steps read a stale value and the gradients come out silently wrong. Static-unrolled (`qd.static(range(...))`) loops are not affected because every iteration becomes its own inlined block at compile time. | ||
|
|
||
| **How Quadrants does it.** An opt-in compiler pipeline called the *autodiff stack* (*adstack*) allocates a per-variable stack alongside each loop-carried primal. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. It is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases (the autodiff pass rejects a non-static range that would otherwise lose its primal); in the narrow cases where the kernel compiles anyway, the reverse pass reads a stale value for every iteration and the gradients come out wrong but non-zero. | ||
| **How Quadrants does it.** An opt-in compiler pipeline called the *autodiff stack* (*adstack*) allocates a per-variable stack alongside each primal that is updated inside the loop and therefore changes from one iteration to the next. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. It is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases (the autodiff pass rejects a non-static range that would otherwise lose its primal); in the narrow cases where the kernel compiles anyway, the reverse pass reads a stale value for every iteration and the gradients come out wrong but non-zero. |
There was a problem hiding this comment.
break the first sentence into two. so each setnence just states a single concept.
There was a problem hiding this comment.
" the autodiff stack (adstack) " => "adstack". We only ever refer to it as adstack, so let's say adstack is its name. We can put the long form in brakcets i youf want "called the adstack (short for "(a)uto(d)iff (stack)")"
There was a problem hiding this comment.
I think remove "and therefore changes from one iteration to the next"
There was a problem hiding this comment.
" It is opt-in" => "adstack is opt-in"
There was a problem hiding this comment.
"Running without it when it is needed raises a QuadrantsCompilationError in most cases (the autodiff pass rejects a non-static range that would otherwise lose its primal);" => nice 🙌
There was a problem hiding this comment.
"in the narrow cases where the kernel compiles anyway, the reverse pass reads a stale value for every iteration and the gradients come out wrong but non-zero." => would be nice to get rid of such exceptional cases. Do we know what they are? Can we document them?
There was a problem hiding this comment.
I know it's not part of your changes in this pr, but it's part of your set of prs overall, and anyway, I think it would be good to address:
I couldnt understand this sentence. I mean, I probably could if I thought about it a long time, bu I think it would be good to explain step by step, so people can just read fluidly, and undersatnd it, without grinding to a halt, and having to work through stuff in their head.
| - A loop-carried dependency (a variable read, written, and read again across iterations, e.g. `v = v * 0.95 + 0.01`). | ||
| - A loop-carried variable - one whose value is carried forward from each iteration into the next, e.g. `v = v * 0.95 + 0.01`. | ||
| - A local variable used as an index into a global field. | ||
| - Non-linear ops (`sin`, `cos`, `exp`, `sqrt`, `tanh`, `pow`, ...) whose derivative depends on the primal value at that iteration. |
There was a problem hiding this comment.
can we have a counter-example of a non-linear op that doesnt need adstack, in the 'do not need it' section below please.
| | nested `for i in range(a[None]): for j in range(b[None])` | `a_max * b_max + 2` | | ||
| | `qd.ndrange(n, m)` with field-derived `n`, `m` | `n_max * m_max + 2` | | ||
|
|
||
| At `max_n_dofs_per_entity = 16`, a 16 x 16 ndrange hits the default exactly (`256`). |
There was a problem hiding this comment.
mixtue of fonts is visually distracting. either put backticks around 16 x 16 too, or remove from = 16
|
|
||
| At `max_n_dofs_per_entity = 16`, a 16 x 16 ndrange hits the default exactly (`256`). | ||
|
|
||
| **Memory footprint.** The pipeline allocates one scratch buffer per piece of reverse-pass state. That count includes every loop-carried variable the reverse pass has to replay, plus any integer counter and any boolean branch flag it has to read back. Total memory across all buffers is approximately |
There was a problem hiding this comment.
whats a 'piece'? per stack carried variable?
There was a problem hiding this comment.
oh you define it next 🤔
There was a problem hiding this comment.
Maybe we just avoid the issue by avoiding having to use this noun at all? for example
"The pipeline allocates a scratch buffer for each loop-carried variable, and also for any loop counter, and any boolean branch flags." ?
Still, I'm unclear about this 'integer counter' and 'boolean branch flag'. You havent defined them before. Could you define these in a previous paragraph please.
|
|
||
| `num_threads` is the number of threads the kernel actually dispatches. On CPU that is the thread-pool size, typically tens. On GPU it is the full ndrange. `bytes_per_slot` scales with the element's storage size and the backend; see the two tables below. | ||
|
|
||
| On LLVM backends (CPU / CUDA / AMDGPU), each adstack slot stores both a primal and an adjoint value, so `bytes_per_slot = 2 * sizeof(T)` for every element type `T`. Common cases: |
| At `max_n_dofs_per_entity = 16`, a `16 x 16` ndrange hits the default exactly (`256`). | ||
|
|
||
| **Memory footprint.** The pipeline allocates one scratch buffer per piece of reverse-pass state. That count includes every loop-carried variable the reverse pass has to replay, plus any integer counter and any boolean branch flag it has to read back. Total memory across all buffers is approximately | ||
| **Memory footprint.** With one scratch buffer per adstack (see above), the total memory cost depends on two further quantities. The first is the number of threads the kernel actually dispatches, which we call `num_threads`. On CPU that is the thread-pool size, typically tens. On GPU it is the full ndrange. The second is `bytes_per_slot`, which scales with the element's storage size and the backend; the two tables below work through its concrete values. Total memory across all buffers is then approximately: |
There was a problem hiding this comment.
have we used the term 'element' thus far? Have we defined it?
|
Checklist:
=> ok to merge |
7c71e52 to
35ff6a8
Compare
ae97db1 to
92198fa
Compare
Heap-backed adstack on LLVM backends (CPU / CUDA / AMDGPU)
TL;DR
Prior behaviour on CPU / CUDA / AMDGPU: every
AdStackAllocaStmtlowered to a function-scopeallocaat the task's entry block, so every adstack lived on the LLVM stack frame (= worker-thread stack on CPU, per-thread local memory on GPU). A kernel with many loop-carried values atdefault_ad_stack_size=256crossed the worker-thread limit and silently corrupted adjacent stack memory; the previous PR in the stack added a 256 KB codegen-time guard that hard-aborted those kernels.This PR moves the storage off the stack:
{ad_stack_offsets_, ad_stack_per_thread_stride_}, andvisit(AdStackAllocaStmt)emitsbase = runtime->adstack_heap_buffer + linear_thread_idx * stride + offsetinstead of an alloca. Base is loaded once inentry_blockand reused.LlvmRuntimeExecutor::ensure_adstack_heap(needed_bytes)grows the per-runtime slab via amortised doubling, publishes the new pointer/size intoruntime->{adstack_heap_buffer, adstack_heap_size}by caching the two device field-pointer addresses on first grow and writing through them on every subsequent grow.ensure_adstack_heap(per_thread_stride * num_threads)before each task launch. Dynamic-bound range-for tasks resolvenum_threadsby readingbegin/endfromruntime->temporariesvia a host-side DtoH memcpy.per_thread_stride > 0, because graph baking precludes the host-sideensure_adstack_heapstep between dispatches.default_ad_stack_sizeexposed viaqd.init(); raised from 32 → 256 now that the per-thread on-chip / worker-stack budget no longer caps it.Nothing changes for kernels that don't enable the adstack extension.
Why
Prior to this PR, a kernel like a reverse-mode articulated-body dynamics step in Genesis hit the 256 KB CPU-stack budget at modest capacities (4 loop-carried f64 variables × 4096 entries × 16 bytes each already crosses it). The two alternatives — ship with
default_ad_stack_sizecapped at a value small enough to fit on every worker stack, or ask users to lowerad_stack_sizeper-kernel — either regress correctness on large kernels or force tuning noise on the user. Moving the storage off-stack removes the constraint entirely: per-thread slice size is bounded only bynum_threads * per_thread_strideand the driver's allocator.Changes
Codegen (
quadrants/codegen/llvm/codegen_llvm.{h,cpp},llvm_compiled_data.h)TaskCodeGenLLVMgrows three new per-task fields:ad_stack_per_thread_stride_— sum ofAdStackAllocaStmt::size_in_bytes()(aligned up to 8) for every adstack in the task.ad_stack_offsets_— map from each alloca stmt to its offset within the per-thread slice.ad_stack_heap_base_llvm_— cached SSA value of the heap base pointer, emitted once inentry_block.init_offloaded_task_functionpre-scans the task body before any codegen runs and populates the first two, so that later sibling allocas never shift an earlier alloca's offset out from under a cached SSA pointer.visit(AdStackAllocaStmt)now emits:linear_thread_idxis the arch-appropriate invocation id (RuntimeContext::cpu_thread_idon CPU;block_idx * block_dim + thread_idxon CUDA / AMDGPU), matching howrand_statesis indexed.The old 256 KB function-scope budget guard (introduced in the previous PR) is deleted; its
ad_stack_fn_scope_bytes_accumulator is gone too. Heap-backed storage makes the ceiling irrelevant.OffloadedTaskgains anAdStackSizingInfo ad_stacksub-struct that propagates sizing to the host launcher:per_thread_stride,static_num_threads,dynamic_gpu_range_for, plus const values and gtmps byte offsets for range-forbegin/end.Per-arch codegen tweaks
codegen_cpu.cpp— fillscurrent_task->ad_stackwith the pre-scanned stride and setsstatic_num_threads = cpu_thread_id_range(CPU thread count is known at compile time).codegen_cuda.cpp— fillscurrent_task->ad_stack.static_num_threads = grid_dim * block_dimfor const-bound tasks, and marksdynamic_gpu_range_for = true+ recordsbegin_offset_bytes/end_offset_bytes/begin_const_value/end_const_valuefor dynamic range-for tasks so the launcher can resolve the actual iteration count at launch time.codegen_amdgpu.cpp— same as CUDA.Runtime (
llvm_runtime_executor.{h,cpp},runtime.cpp)LLVMRuntimegains two new fields:Ptr adstack_heap_buffer = nullptr; u64 adstack_heap_size = 0;. These are read by every adstack-backed task on the device side; the host writes to them through the cached field-pointer addresses.LlvmRuntimeExecutor::ensure_adstack_heap(needed_bytes):needed_bytes == 0 || needed_bytes <= adstack_heap_size_.new_size = max(needed_bytes, 2 * adstack_heap_size_)(amortised doubling).llvm_device()->allocate_memory), wraps in aDeviceAllocationGuard.runtime->adstack_heap_bufferandruntime->adstack_heap_size, caches them.memcpy_host_to_device(CUDA / AMDGPU) or plain pointer stores (CPU) against the cached addresses — no per-grow kernel launch.DeviceAllocationGuardvia move-assignment. Safety of the release (see the detailed block comment in the.cppand the matching field comment in the.h): CPU usesstd::free(trivially safe); CUDAcuMemFree_v2synchronises before returning; AMDGPUdealloc_memorypools throughCachingAllocator::releasewithout sync, and cross-launch safety on AMDGPU is provided by the synchronoushipFree(context_pointer)at the tail ofamdgpu::KernelLauncher::launch_llvm_kernel(the latent-fix in [AutoDiff] Autodiff 11: Latent adstack-adjacent fixes (AMDGPU hipFree, flush() keeps ctx_buffers_, always-preallocate) #536).get_runtime_temporaries_device_ptr()— cached lookup ofruntime->temporaries, used by the GPU launchers to read back dynamic range-for bounds.Per-arch launchers
runtime/cpu/kernel_launcher.{h,cpp}—Contextgains a parallelad_stack_needed_bytesvector, precomputed at register time (CPU sizing is static).launch_offloaded_taskscallsensure_adstack_heapper task.runtime/cuda/kernel_launcher.cpp— addsresolve_num_threads(task)which DtoH-memcpysbegin/endfromruntime->temporariesfor dynamic range-for tasks; callsensure_adstack_heapper task.runtime/amdgpu/kernel_launcher.cpp— same as CUDA.runtime/cuda/graph_manager.cpp— hard-errorsgraph=Trueon kernels where any task hasper_thread_stride > 0. Graph baking precludes host-side intervention between dispatches.Capacity knob (
compile_config.h,python/export_lang.cpp)default_ad_stack_sizeraised from 32 to 256.qd.init()kwarg. The comment block is rewritten to reflect the new heap-backing reality.ad_stack_size(per-stack explicit capacity) is unchanged.Docs (
docs/source/user_guide/autodiff.md)Drops the "SPIR-V on-chip cap" limitation from the known-limitations list (that was about the prior Function-scope SPIR-V path; with the SPIR-V heap landing in the next PR, it's gone too). Adds a "Tuning the capacity" section explaining
default_ad_stack_sizevsad_stack_sizeand the K+2-pushes-per-iteration rule for picking N.Tests (
tests/python/test_adstack.py)Heap-specific additions:
test_adstack_heap_grow_on_demand— two launches at increasing capacity pinpoint that the amortised-doubling grow path fires and the second launch reuses the bigger slab.test_adstack_heap_backed_exceeds_old_threadstack_budget— a kernel whose per-thread adstack bytes exceed the pre-PR 256 KB ceiling now compiles and runs correctly.test_adstack_cuda_graph_rejected_with_adstack—graph=Trueon an adstack kernel raises.Side-effect audit
AdStackSizingInfofields serialised via the existingOffloadedTaskkey hashStmtfields (all sizing lives in codegen /OffloadedTask)per_thread_stride > 0hipFree(context_pointer)tail.cppand.hcommentsper_thread_stride == 0path short-circuits the launcher-sideensure_adstack_heapStack
Split 2/3 of the former "heap-backed adstack" PR. Based on #536 (latent fixes). Followed by #493 (SPIR-V heap).