Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 86fb79c378
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
86fb79c to
e38637b
Compare
853b28d to
7e26214
Compare
24f73f5 to
453d0a8
Compare
e38637b to
16acc6b
Compare
bc3bf3f to
453d0a8
Compare
16acc6b to
6d6b0d1
Compare
bc3bf3f to
b757b5a
Compare
6d6b0d1 to
bd11a8a
Compare
b757b5a to
4b1b402
Compare
bd11a8a to
08789d5
Compare
babc027 to
5182192
Compare
08789d5 to
87a2224
Compare
5182192 to
f5afee0
Compare
87a2224 to
deb7004
Compare
f5afee0 to
b472593
Compare
5b3a7ed to
1c28866
Compare
…wn-link-check stops failing
… the AllocaStmt site so they dominate every sibling push / load-top / acc-adjoint block
…lag, symmetric d2h grad gate, IRBuilder::cast skips OpBitcast(T,T), compute_dense_snode_strides refuses multi-child parents, kMaxNodes host-side guard, restore mismatched-lengths fusion regression, scrub remaining external-project references
…me/gfx/adstack_sizer_launch.cpp so runtime.cpp stays focused on the main-kernel record/submit flow
…reter switch as a no-op so -Wswitch stays silent; the LLVM encoder always host-folds FieldLoad leaves, so this arm is unreachable but keeps the enum coverage exhaustive
…iness and widen Supported loop shapes table from the test suite
…k bytecode offsets up to 256B for VUID-02999-safe storage-buffer descriptor bindings, bump kMaxVars to kAdStackSizeExprDeviceMaxBoundVars (32) and expose kAdStackSizerMaxPendingFrames so the host encoder hard-errors before the shader's fixed-size pending-frame stack can OOB, clamp pending_end_arr to begin+(1<<24) so MaxOverRange pathological ranges silently break matching the LLVM interpreter and host evaluator
…AD | WRITE) so a kernel that only non-atomic-writes .grad does not leave allocator garbage on device for d2h to blit back, and update kExternalTensorRead comment in adstack_size_expr_device.h to match the (idx, stride) pair layout the encoder actually emits
… alongside the SPIR-V-scope tests: parametrize the mismatched-lengths test with a peak-past-shape-b case that exercises the cross-ndarray independent-loops fusion path, and add a fld[i]-arr[i] test that guards against the fusion synthesizing a mixed FieldLoad+ExternalTensorRead body the LLVM-GPU encoder cannot lower
…ack sizer dispatch via track_physical_buffer, mirroring what the main-kernel dispatch already does at runtime.cpp's pre-dispatch block; without the useResource: hint the Apple7 GPU family (M1) returns zero for the sizer shader's PSB load of the ndarray pointer through the kernel arg buffer and every MaxOverRange-over-ExternalTensorRead collapses to zero, tripping an Adstack overflow at the next qd.sync(); non-Metal backends no-op the base track_physical_buffer and pay nothing
| **Problem.** Reverse-mode AD through a dynamic loop (one whose trip count is not known at compile time) needs to recover the primal value at each iteration when walking the loop backwards. Without that, the chain-rule steps read a stale value and the gradients come out silently wrong. Static-unrolled (`qd.static(range(...))`) loops are not affected because every iteration becomes its own inlined block at compile time. | ||
|
|
||
| **How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes an entry each iteration; the reverse pass pops them back off in reverse order to recover the correct primal for every chain-rule step. adstack is opt-in because it costs extra per-thread memory and compile time, and because most kernels do not need it. Running with adstack enabled when it is not strictly needed is safe. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check. | ||
| **How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes one entry per iteration. The reverse pass walks the stack from top down: at each reverse iteration it reads the current top entry as many times as that iteration's chain-rule contributions need (one `read` per downstream use of the primal), then pops the entry once and steps to the iteration underneath. Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a `v` that appears in both a `sin(v)` and a subsequent `v * w` needs to be visible to both adjoint terms before it is discarded. Enabling adstack costs extra per-thread memory and compile time, but some kernels need it. |
There was a problem hiding this comment.
"from top down: at each reverse " => "from top down. At each reverse "
There was a problem hiding this comment.
"(one read per downstream use of the primal)"
=> move to a new sentence after the current one, (so we don't need to pusht he current sentence onto the stack whilst reading, then read the parentheses, then pop the ucrrent setnecne back off the stack in our mind, which makes things harder to read I feel)
There was a problem hiding this comment.
"Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a v that appears in both a sin(v) and a subsequent v * w needs to be visible to both adjoint terms before it is discarded. " => This seems like an implementation detail, not needed by the end-user? => "under the hood" style section
There was a problem hiding this comment.
"The flag is compile-time, so it must be set before the offending kernel compiles. " => this seems superfluous? Remove, or move to 'under the hood' (you are justifying your impelmetnation design I feel, rather than telling the user what they need to know in order to use the feature).
| **How Quadrants does it.** Quadrants provides a dedicated compiler pipeline for this, called the *adstack* (short for "(a)uto(d)iff (stack)"). It allocates a per-variable stack alongside each primal that is updated inside the loop. The forward pass pushes one entry per iteration. The reverse pass walks the stack from top down: at each reverse iteration it reads the current top entry as many times as that iteration's chain-rule contributions need (one `read` per downstream use of the primal), then pops the entry once and steps to the iteration underneath. Peeking-and-reusing rather than popping per use matters because the same primal often feeds several chain-rule terms in one iteration - e.g. a `v` that appears in both a `sin(v)` and a subsequent `v * w` needs to be visible to both adjoint terms before it is discarded. Enabling adstack costs extra per-thread memory and compile time, but some kernels need it. | ||
|
|
||
| **Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. | ||
| **Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU). |
There was a problem hiding this comment.
"Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU)." => this seems like details best left for later? (maybe some 'compatibilty' section or simlar?)
| **Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. | ||
| **Workflow.** Enable the pipeline at init time and keep using the normal reverse-mode workflow: `qd.init(..., ad_stack_experimental_enabled=True)`. The flag is compile-time, so it must be set before the offending kernel compiles. Any mainstream modern GPU works on the SPIR-V side (Apple Silicon for Metal, any recent discrete GPU for Vulkan); on the odd legacy device that is missing the hardware features the sizer relies on, the launch errors out with a clear message and you can fall back to the LLVM runtime (CPU / CUDA / AMDGPU). | ||
|
|
||
| **Note.** Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check. |
There was a problem hiding this comment.
"A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface" => does this mean the gradients will in fact be correct? or will be silently wrong?
There was a problem hiding this comment.
"if you see wrong-but-non-zero gradients through a dynamic loop " => how to detect this?
There was a problem hiding this comment.
how to detect this
There is no way to detect this I'm afraid. Just running some external finite difference check, or precising turning it on and rerun as a sanity check. We can just say: "Checking that the value of the gradient is unaffected when turning adstack on is the best way to determine whether adstack is necessary."
There was a problem hiding this comment.
"Checking that the value of the gradient is unaffected when turning adstack on is the best way to determine whether adstack is necessary." Yup, if that's what we have, let's run with that. Maybe loosen "the best way" to "is a reasonable way"
|
|
||
| **Note.** Running with adstack enabled when it is not strictly needed is safe, but not the other way around. Running without it when it is needed raises a `QuadrantsCompilationError` in most cases: the autodiff pass rejects a non-static range that would otherwise lose its primal. A few edge-case loop shapes still slip past that rejection and produce silently-wrong gradients; these are tracked and fixed in the autodiff pass as they surface, so if you see wrong-but-non-zero gradients through a dynamic loop with adstack disabled, turn it on and rerun as a sanity check. | ||
|
|
||
| Reverse-mode AD walks the forward kernel in reverse and applies the chain rule at every op. The chain-rule factor at each op is that op's derivative with respect to its input. For *non-linear* ops (`sin`, `cos`, `exp`, `sqrt`, `tanh`, `pow`, ...) that derivative depends on the input's primal, so the reverse pass needs the primal value that was there on the forward pass. For *linear* ops (addition, subtraction, multiplication by a constant) the derivative is itself a constant and no primal is needed. In a dynamic loop the forward pass writes a different primal at each iteration, so the reverse pass cannot simply re-read the latest value - it needs one per iteration. adstack provides exactly that: a per-iteration stash of the primal. |
There was a problem hiding this comment.
Nice and clear to me 🙌
| ### Supported loop shapes | ||
|
|
||
| *You do not need to read this section to use reverse-mode AD. If a kernel exceeds its adstack capacity, Quadrants raises a Python exception at the next `qd.sync()` whose message recommends bumping `default_ad_stack_size` - that is usually enough. Read on only if you hit that overflow, want to understand why, or want to cap the memory footprint explicitly.* | ||
| The compiler recognises a set of bound shapes: |
There was a problem hiding this comment.
I think we should just delete this seciton, or move it to some highly advanced seciton.
Users will just expect stuff to work I feel, and anything else is a bug.
What's an example of something that does NOT work currrnelty?
There was a problem hiding this comment.
What's an example of something that does NOT work currrnelty?
Actually I would prefer to keep it. We only support a small subset of all the craziness a user may want to do. Here are examples that are not supported:
@qd.kernel
def my_kernel(a)
for i in range(sqrt(a)): # on any non-linear transformation plus min/max
[...]
@qd.kernel
def my_kernel(a, b)
for i in range(a):
for j in range(b):
for k in range(a[i], b[j]): # Non-matching running variables+indices
[...]
@qd.kernel
def my_kernel(a)
for i in range(a.shape[0]):
while a[i] < 10: # Bound that cannot be determined without running the kernel
a[i] = a[i] + 1
There was a problem hiding this comment.
Alright, so move to a reference section. And state something like "Note that whilst we support many common loop constructs, many constructs which are allowed in the absence of adstack are not currently handled. See Appendix A for supported loop constructs."
| The compiler recognises a set of bound shapes: | ||
|
|
||
| **Tuning.** Two `qd.init()` knobs control adstack sizing, both measured in slots per adstack (not bytes): | ||
| | Bound shape | Example | |
There was a problem hiding this comment.
this table seems like something to go in a reference section.
"The compiler recognises a set of bound shapes (see 'appendix 123' for authorative list). Any loop shape outside this set is rejected at compile time. The error names the offending source line; typical fixes are to restructure the loop into one of the shapes in the appendix, or to file a bug. "
(but see above comment that we should just delete this seciton altogether, or move the entire section to highly advanced seciton)
There was a problem hiding this comment.
Where would you like to put this appendix? Another markdown page?
There was a problem hiding this comment.
At the end of the same markdown page.
| ### Under the hood (advanced) | ||
|
|
||
| **Sizing rule.** A `K`-iteration dynamic loop consumes `K + 2` slots in each of its adstacks: one slot per forward iteration, plus two setup slots (one for the initial adjoint, one for the primal's starting value). `default_ad_stack_size` is a per-stack slot count, so size it at the worst-case trip count of the deepest unprovable dynamic loop in the program plus 2. | ||
| *This section is a glimpse of how the adstack is sized and laid out in memory. You never need to read it to use reverse-mode AD - skip it unless you are curious, debugging an overflow or OOM, or planning to extend the list of supported loop shapes.* |
There was a problem hiding this comment.
"You never need to read it to use reverse-mode AD " => "never" seems a very strong claim. I'm pretty skeptical of this claim given the constraints on memory, loop structure etc.
I prefer the earlier pharsing. I don't remember what that was.
Oh here we go:
"You do not need to read this section to use reverse-mode AD. Skip past it unless you hit an overflow error on SPIR-V, an out-of-memory error on GPU, or a compile error pointing at an adstack alloca."
so:
- remove the first sentence, superfluous
- remove 'never'
- break second sentence into two
- basically just use the original paragraph I pasted above :)
| | `num_buffers` | Number of adstacks the kernel allocates - one per loop-carried variable plus one per dependent branch flag (see [One adstack per variable](#one-adstack-per-variable)). | | ||
|
|
||
| On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time: | ||
| Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`. |
There was a problem hiding this comment.
"but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens bool to i32 at storage time, because SPIR-V has no defined layout for OpTypeBool." => I would break this platform-specific stuff into a seprate paragraph somehow.
"Platform-specific notes:
- even though integer/boolean adstacks do not need an adjoint slot, LLVM backends still carry one for codegen uniformity
- SPIR-V stores
bools using 4 bytes (32 bits)"
|
|
||
| On SPIR-V backends (Metal / Vulkan) the slot layout is trimmed: adstacks whose `T` is an integer type (`i32`, `i64`, ...) only store the primal because the reverse pass does not accumulate integer adjoints, and per-thread on-chip memory is more constrained than on LLVM. So `bytes_per_slot = sizeof(T)` for integer `T` and `bytes_per_slot = 2 * sizeof(T)` for floating-point `T`. SPIR-V has no defined layout for `OpTypeBool`, so booleans are widened to i32 at storage time: | ||
| Every adstack slot always stores a *primal* value - the forward-pass value the reverse pass pops to recover the chain-rule step. Floating-point adstacks additionally store an *adjoint* slot where the reverse pass accumulates chain-rule contributions. Integer / boolean adstacks do not need an adjoint slot, but LLVM backends still carry one for codegen uniformity. SPIR-V backends trim it. SPIR-V also widens `bool` to `i32` at storage time, because SPIR-V has no defined layout for `OpTypeBool`. | ||
|
|
There was a problem hiding this comment.
"Therefore some examples of the size of adstack slots on each platform are as follows:"
| - Reverse-mode AD does not propagate gradients through integer casts or non-real operations. No error is raised; the gradient simply stops at the cast and silently reads as zero upstream. Cast to `qd.f32` / `qd.f64` before the differentiable section. | ||
| - Backward passes on non-trivial kernels run noticeably slower than the corresponding forward pass, sometimes by an order of magnitude on SPIR-V. | ||
| - **Loop bounds read from a writable ndarray are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]`, the sizer snapshots `n[j]` at backward-dispatch entry. Any scenario that makes that snapshot diverge from the value the kernel body ends up executing against drives the adstack undersized and the gradient silently comes out zero or wrong. Two patterns to avoid: (a) the same kernel writes `n[j]` before the loop reads it (`for i in range(n[j])` after `n[j] = something`); (b) across kernels, where `kernel_A` writes `n` and `kernel_B.grad()` reads it as a loop bound - the per-launch numpy-ndarray upload can overwrite the device copy with host-side state that differs from what `kernel_A` left behind, and for `qd.ndarray` a missing host-side barrier between the two launches has the same effect. Quadrants does not detect either pattern. Workaround: treat ndarrays the reverse-mode kernel reads as loop bounds as read-only within and across the relevant kernel chain - compute iteration counts once, store them in a dedicated scalar field or ndarray that is never written during the backward path, and iterate over that. | ||
| - **When a dynamic-loop reverse kernel fails.** In normal use, enabling the adstack pipeline and running a reverse-mode kernel through a dynamic loop should just work. Two rare situations can make it fail, and it is worth knowing which one you are in: |
There was a problem hiding this comment.
"In normal use, enabling the adstack pipeline and running a reverse-mode kernel through a dynamic loop should just work. Two rare situations can make it fail, and it is worth knowing which one you are in:" => this si superfluous I feel.
There was a problem hiding this comment.
Are these really limitations? It seems like saying "a limitation of numpy tensors is that they cannot be larger than available physical memory."? 🤔
There was a problem hiding this comment.
These should probably go in some "Gotchas" or "What can go wrong" or simliar kind of section. I'm not sure they are limtiations?
There was a problem hiding this comment.
Well, the thing is that it can eat memory pretty fast, and what is being allocated and why is not immediately obvious. When you allocate a gigantic numpy array you will notice it, here it is hidden.
There was a problem hiding this comment.
I would argue that a 'limitation' is some issue with our implementation, rather than a fundamental characteristic of the underlying maths etc.
As suggested, lets have a section on things that can go wrong, that users can consult.
Analogy:
doc 1:
"Limitations:
- this toaster does not work when it is not plugged into the power"
doc 2:
"Why doesn't my toaster work?
- check: is it plugged into the power?"
| - split the inner section into its own `@qd.kernel`; | ||
| - pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer). | ||
| - *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom. | ||
| - **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads. |
There was a problem hiding this comment.
is this a limtiation? Seems more like some kind of performance characteristic?
There was a problem hiding this comment.
Where do you want to mention this kind of information if not in limitations?
There was a problem hiding this comment.
"Performance characteristics"?
"Performance"?
| - pass `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer). | ||
| - *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom. | ||
| - **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads. | ||
| - **Backward passes are slower than forward passes.** In particular, SPIR-V backward passes can be an order of magnitude slower than the forward pass. |
There was a problem hiding this comment.
this isnt really a limitation I feel. In general, in the convnet world at least, backward passes are matehtmatically twice as long as forward passes. Thtas not really a limtation. It just cmes out of the maths.
Maybe just limit to "SPIR-V backward passes can be an order of magnitude slower than the forward pass" (maybe say why?)
| - *The kernel is legitimately too deep for the hardware* - surfaces as an out-of-memory error from the allocator, before the kernel even starts running. A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom. | ||
| - **Compile time scales with loop nesting.** The adstack pipeline trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile time accordingly when migrating existing reverse-mode AD workloads. | ||
| - **Backward passes are slower than forward passes.** In particular, SPIR-V backward passes can be an order of magnitude slower than the forward pass. | ||
| - **Integer casts stop gradients.** Reverse-mode AD does not propagate gradients through integer casts or non-real operations: integers have no meaningful derivative, so the gradient reads as zero upstream of the cast - no error, just silently-zero gradients. Rules of thumb: |
There was a problem hiding this comment.
is this really a limitation? This is just a characteristc of differentiation I think? You cannot pass a gradient through a discretization step. (at least, not without using approximatinos, such as Gumbo etc).
| - keep differentiable variables in `qd.f32` / `qd.f64` through the full forward chain; | ||
| - casting *to* a float is safe - the downstream float section remains differentiable, and the cast itself contributes a unit factor to the chain rule; | ||
| - casting *back to* an integer stops the gradient at that point, so only do it at integer-indexing sites, after any arithmetic whose gradient you need. | ||
| - **Loop bounds that read an ndarray written to by the same kernel are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]` and the same kernel also writes to `n[j]` before that loop runs (`for i in range(n[j])` after `n[j] = something`), the computed gradient may silently come out zero or wrong. Quadrants does not currently detect this. Workaround: put iteration counts in a separate ndarray (or scalar field) the kernel only reads, or split the write and the differentiable loop into two `@qd.kernel` calls. The same caveat applies across kernels - a kernel that writes an ndarray read as a loop bound by a later `.grad()` call needs the host-side contents of that ndarray to already hold the value the backward kernel will execute against. |
There was a problem hiding this comment.
it's not clear to me from skim-reading whether this is a limitation of our implemetation, or just a fundamental charcteristic that falls out of the maths. I'm guessing the former. Can you clarify in reply to this comment please.
There was a problem hiding this comment.
I don't know exactly. What I know for sure is that I found it very unexpected when I discovered this limitation and that it is not something uncommon. We need this in Genesis typically. I would be surprised if such limitation exists in Torch, otherwise more people would compute complete wrong gradient x) The biggest issue is that it fails silently.
…section: split "Known limitations" into "What can go wrong" + "Performance characteristics", move the supported-loop-shapes table to Appendix A with a scoped title, relocate "Integer casts stop gradients" to the main reverse-mode section as a property of differentiation, rewrite the mutated-bound bullet against the verified sizer-reads-twice mechanism with a Torch contrast, strip OpTypeBool/alloca/compile-time-synthesized jargon, anchor SPIR-V GPU compatibility to late-2018 hardware; add test_adstack_sizer_trip_count_qd_ndarray_mutated_by_separate_kernel as a strict xfail pinning the cross-kernel pattern verified failing on cpu/metal/vulkan
| int32_t compute_max_mor_nesting(const SerializedSizeExpr &expr) { | ||
| std::vector<int32_t> depth(expr.nodes.size(), -1); | ||
| std::function<int32_t(int32_t)> visit = [&](int32_t i) -> int32_t { | ||
| if (i < 0 || static_cast<std::size_t>(i) >= expr.nodes.size()) | ||
| return 0; | ||
| if (depth[i] >= 0) | ||
| return depth[i]; | ||
| const auto &n = expr.nodes[i]; | ||
| int32_t child_max = 0; | ||
| for (int32_t c : {n.operand_a, n.operand_b, n.body_node_idx}) { | ||
| if (c >= 0) | ||
| child_max = std::max(child_max, visit(c)); | ||
| } | ||
| int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0; | ||
| depth[i] = self + child_max; | ||
| return depth[i]; | ||
| }; | ||
| int32_t max_depth = 0; | ||
| for (std::size_t i = 0; i < expr.nodes.size(); ++i) { | ||
| max_depth = std::max(max_depth, visit(static_cast<int32_t>(i))); | ||
| } | ||
| return max_depth; | ||
| } |
There was a problem hiding this comment.
🟡 compute_max_mor_nesting at adstack_size_expr_eval.cpp:330-352 over-counts MOR depth by walking operand_a and operand_b as if they accumulate alongside body_node_idx, but in case_mor of the SPIR-V sizer shader (adstack_sizer_shader.cpp:706-767) begin and end subtrees evaluate fully BEFORE the outer MOR pushes its body frame, so they only contribute transient depth. The correct recurrence is max(visit(operand_a), visit(operand_b), self + visit(body_node_idx)); the current self + max(...) is sound but causes spurious QD_ERROR rejections at the kAdStackSizerMaxPendingFrames=16 cap for kernels whose true peak fits but whose begin/end MOR chains push the apparent depth over. Nit severity: over-conservative bound, not a memory-safety issue.
Extended reasoning...
What the bug is
compute_max_mor_nesting at quadrants/program/adstack_size_expr_eval.cpp:330-352 walks the post-order tree and, for any MaxOverRange node, returns:
int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0;
depth[i] = self + child_max; // child_max = max(visit(op_a), visit(op_b), visit(body))i.e. self + max(visit(operand_a), visit(operand_b), visit(body_node_idx)). The encoder hard-errors at encode_bytecode_common when this depth exceeds spirv::kAdStackSizerMaxPendingFrames = 16 (adstack_size_expr_eval.cpp:711-716).
Why the recurrence does not match the runtime
The SPIR-V sizer shader processes nodes iteratively in post-order. In case_mor (quadrants/codegen/spirv/adstack_sizer_shader.cpp:706-767):
- Lines 709-710 read
begin_i64 = load_values_at(op_a_i32)andend_i64 = load_values_at(op_b_i32)-- already-cached values fromvalues_arr[]. Because the bytecode is post-order, theop_aandop_bsubtrees were fully processed BEFORE control reached this MOR node: any nestedMaxOverRangeinside them executed its owncase_mor(push), walked its body, and hitdone_lbl(pop), restoringspto whatever it was before that subtree began. - Lines 743-756 push exactly ONE pending frame for this MOR (the parent), then redirect
currenttobody_start. The body chain runs under that pushed frame.
So during evaluation of operand_a / operand_b, the outer MOR is NOT yet on the pending stack. Only the body chain runs with the outer frame pushed. The actual peak pending-frame depth is:
peak(MOR) = max(peak(operand_a), peak(operand_b), 1 + peak(body))
NOT 1 + max(peak(operand_a), peak(operand_b), peak(body)) as the helper computes.
Step-by-step counterexample
Take MOR_outer(operand_a = MOR(MOR(MOR(plain))), operand_b = plain, body = plain):
- Linear walk processes the 3-deep nested MOR chain on the
operand_aside first. Each nested MOR runscase_mor: push frame (sp += 1), walk body, hitdone_lbl(pop, sp -= 1). The chain reaches a transient peak ofsp = 3while inside the innermost body, then unwinds back tosp = 0. - After
op_afinishes,op_b(a plain leaf) writes tovalues[]without touching the pending stack.spstays at 0. - Control reaches
MOR_outer'scase_mor. It pushes one frame (sp 0->1) and walks its body (a plain leaf), then pops (sp 1->0).
True peak across the whole evaluation = 3 (attained inside the begin chain, with MOR_outer not yet on the stack).
compute_max_mor_nesting(MOR_outer) returns: 1 + max(visit(begin), visit(end), visit(body)) = 1 + max(3, 0, 0) = 4.
The over-count is exactly +1 per outer MOR whose max(begin_depth, end_depth) > 1 + body_depth. With kAdStackSizerMaxPendingFrames = 16, a kernel whose true peak depth is 16 but whose begin/end MOR chains accumulate to 16+ via this recurrence is rejected at host encoding time even though it would run correctly.
Why existing code does not catch it
The cap check is the only consumer of compute_max_mor_nesting. It treats the helper's output as a worst-case at-runtime depth and rejects trees above 16. There is no runtime cross-check against the actual peak sp in the shader, so the discrepancy surfaces only as a false-positive compile error when the helper's result crosses 16 but the real peak does not. Since the inequality is always helper >= true_peak, the cap of 16 still safely prevents OOB on the shader-side pending_*_arr[16].
Impact
Severity is nit because the over-count is sound -- it only ever rejects kernels that would actually run correctly. Realistic reverse-mode kernels have shallow MOR nesting (verifiers note <=4 distinct bound vars, ~10 tree depth observed); the trigger requires deep MOR chains (>1 nesting) inside begin/end of an outer MOR, which is uncommon today but can compound through nested wrapping in larger kernels. No memory-safety hazard, no UB, no under-counting. The only user-visible effect is a spurious MaxOverRange nesting depth ... exceeds the sizer shader's ... compile error on a kernel that would otherwise fit.
Fix
One-line change at line 339-344:
const auto &n = expr.nodes[i];
int32_t a_depth = (n.operand_a >= 0) ? visit(n.operand_a) : 0;
int32_t b_depth = (n.operand_b >= 0) ? visit(n.operand_b) : 0;
int32_t body_depth = (n.body_node_idx >= 0) ? visit(n.body_node_idx) : 0;
int32_t self = static_cast<SizeExpr::Kind>(n.kind) == SizeExpr::Kind::MaxOverRange ? 1 : 0;
depth[i] = std::max({a_depth, b_depth, self + body_depth});This matches the runtime peak exactly: begin and end subtrees contribute their own peaks (taken individually since they are sequential, not stacked), and the body contributes the outer MOR's pushed frame plus the body chain's own depth. The encoder's hard-error guard then fires only when the actual runtime peak would exceed 16.
| if (autodiff_mode == AutodiffMode::kReverse && ad_use_stack) { | ||
| irpass::determine_ad_stack_size(ir, config); | ||
| print("Autodiff stack size determined"); | ||
| } |
There was a problem hiding this comment.
🔴 scalarize drops AdStackAllocaStmt::size_expr (and stack_id) when lowering tensor-typed adstacks to scalar ones. Because this PR moves determine_ad_stack_size from offload_to_executable into compile_to_offloads (lines 160-163), the pre-pass now runs BEFORE scalarize and populates size_expr/max_size=1 on the tensor adstack; scalarize.cpp:719 then creates fresh scalar AdStackAllocaStmts via make_unique(element_type, stmt->max_size) which only copies max_size, leaving size_expr=nullptr on every scalar child. Any reverse-mode kernel with a Vector/Matrix loop-carried variable inside a field- or ndarray-bounded dynamic loop overflows on the first push and surfaces as 'Adstack overflow' at qd.sync(). Fix is one line in scalarize.cpp:719+: copy stmt->size_expr (a shared_ptr; safe to share since SizeExpr is documented immutable post-pre-pass) into each scalar child.
Extended reasoning...
What the bug is
The PR moves determine_ad_stack_size from offload_to_executable (after scalarize) to the end of compile_to_offloads at lines 160-163 (before scalarize). auto_diff.cpp:752 creates tensor-typed AdStackAllocaStmts for Vector/Matrix loop-carried variables (dtype = alloc->ret_type.ptr_removed()). The new pre-pass position means determine_ad_stack_size now sees these tensor-typed adstacks and, for non-const symbolic bounds, populates size_expr and seeds max_size = 1 as a placeholder (determine_ad_stack_size.cpp:1342-1344).
Then scalarize.cpp:712-728 runs (gated on config.real_matrix_scalarize, default true) and creates fresh scalar adstacks:
auto scalar_ad_stack = std::make_unique<AdStackAllocaStmt>(element_type, stmt->max_size);The constructor at statements.h:1611 only initializes dt and max_size; size_expr (line 1606) and stack_id (line 1609) keep their default values of nullptr and -1. Each scalar child carries max_size=1 and size_expr=nullptr, then the original tensor adstack is erased (line 727).
Step-by-step proof
Concrete trigger:
@qd.kernel
def k():
for i in x:
v = qd.Vector([0.0, 0.0, 0.0])
for _ in range(n[None]): # field-bounded dynamic loop
v = qd.sin(v) + 0.1
y[i] = vauto_diff.cpp:752makesAdStackAllocaStmt(TensorType<3 x f32>, 0)forv.- Bellman-Ford leaves
max_size=0(positive loop). determine_ad_stack_size(now atcompile_to_offloads.cpp:161) walks pushes, buildssize_expr = MaxOverRange(... FieldLoad(n)), setsmax_size=1placeholder per the comment atdetermine_ad_stack_size.cpp:1338-1340.scalarizeruns inoffload_to_executableat line 302.visit(AdStackAllocaStmt)creates 3 scalar adstacks withmake_unique<AdStackAllocaStmt>(element_type, 1)—size_expr=nullptron each.- Codegen pre-scan in
init_offloaded_task_function(codegen_llvm.cpp:1769-1782) assignsstack_idand pushesalloca->size_expr ? alloca->size_expr->serialize() : SerializedSizeExpr{}— empty for every scalar child.AdStackAllocaInfo::max_size_compile_time = 1. - The codegen assertion
stmt->max_size > 0 || stmt->size_exprpasses (1 > 0). Stack id assigned, heap stride sized to1slot. - Runtime
publish_adstack_metadata(llvm_runtime_executor.cpp:671) seessize_exprs[i].nodes.empty(), falls back tomax_size_compile_time=1. Per-thread heap holds 1 slot per scalar adstack. - First push to v's scalar adstack overflows the per-thread slice. Surfaces as
Adstack overflowat the nextqd.sync().
Why existing code does not prevent it
Pre-PR, determine_ad_stack_size ran AFTER scalarize (in offload_to_executable, called at line 287 of the pre-PR file), so scalar adstacks created by scalarize had their max_size resolved by the analyzer directly. The new ordering inverts this: the symbolic bound is computed on the soon-to-be-deleted tensor adstack, then dropped on the floor.
Existing tensor-adstack tests in tests/python/test_adstack.py do not catch this: every Vector/Matrix reverse-mode test either uses ad_stack_size=N (the global override that bakes max_size=N into the alloca and bypasses the pre-pass entirely — the gather predicate at determine_ad_stack_size.cpp filters max_size==0) or uses a static iteration count (where Bellman-Ford resolves max_size directly to a constant > 1, which scalarize correctly copies).
Impact
Silent under-sizing of the per-thread adstack heap on a realistic kernel shape (Vector/Matrix loop-carried variables with field/ndarray-bounded dynamic loops). Surfaces as a runtime overflow at qd.sync() on every backend, with the misleading message that points at restructuring the loop or extending the grammar even though the structural pre-pass DID resolve it correctly. The grammar gap message is wrong for this case — the bound is fine, the propagation is broken.
Fix
One-line change in scalarize.cpp:719+:
auto scalar_ad_stack = std::make_unique<AdStackAllocaStmt>(element_type, stmt->max_size);
if (stmt->size_expr) {
scalar_ad_stack->size_expr = stmt->size_expr;
}
scalar_ad_stack->ret_type = element_type;
scalar_ad_stack->ret_type.set_is_pointer(true);SizeExpr is documented as immutable post-pre-pass (PR description's Side-effect audit row on Stmt clone / serialization), so the shared_ptr copy is cheap and safe to share across scalarized children. The original tensor adstack is being erased anyway, so the parent's shared_ptr ownership transfers naturally.
An alternative would be to re-run determine_ad_stack_size after scalarize (the deleted call site at offload_to_executable.cpp:288), but option (a) is smaller and matches the PR's stated immutability invariant.
|
Nice doc! Thanks! Checklist:
=> ok to merge |
Runtime-evaluated adstack sizing on LLVM backends
TL;DR
Before this PR, the inner
range(n[None])forced the CFG analyzer into itsdefault_ad_stack_size=256fallback, son[None] = 512overflowed at the nextqd.sync(). After this PR the compiler captures the loop bound as aSizeExpr, the launcher evaluates it againstn[None]at launch time, and the backing heap is grown to exactly512 + 2slots.Why
Reverse-mode AD through a dynamic loop requires an adstack whose depth equals the forward trip count plus two setup pushes. Until now Quadrants had to resolve that depth at compile time, so every loop bound that involved a field load (
n_bodies,n_dofs, an ndarray shape, an enclosing parallel-for index) fell back to the user-configureddefault_ad_stack_size. For realistic reverse-mode kernels this is both:ndrange(n_bodies, n_dofs)easily needs >256 slots, triggering a runtimeQuadrantsAssertionError: Adstack overflow.The traditional workaround - tune
default_ad_stack_sizeper program - is not tunable enough: you cannot set a different default for every adstack, and bumping it globally multiplies GPU memory by the number of threads. This PR replaces the static fallback with a symbolic upper-bound expression captured at compile time and reduced per launch against the live field / ndarray state, so each adstack gets exactly the depth that dispatch needs.Entry point
LlvmRuntimeExecutor::publish_adstack_metadata(const AdStackSizingInfo &ad_stack, std::size_t num_threads, LaunchContextBuilder *ctx, void *device_runtime_context_ptr = nullptr)atquadrants/runtime/llvm/llvm_runtime_executor.cpp. Called fromcpu/cuda/amdgpu/kernel_launcher.cppright before every task's dispatch. It:size_exprsvector (post-orderSerializedSizeExprtrees, one perAdStackAllocaStmt).adstack_size_expr_eval.cpp, reading scalar field values throughSNodeRwAccessorsBankand launch-context ndarray shapes / scalar reads through theLaunchContextBuilder. The AMDGPU path threads the launcher's device-sideRuntimeContextstaging copy (kernel_launcher.cpp:185-186) throughdevice_runtime_context_ptrso HIP kernels that resolveExternalTensorReadleaves against ndarray pointers do not fault on a host pointer; CUDA relies on UVA and passesnullptr; CPU ignores the argument.LLVMRuntime::adstack_{per_thread_stride,offsets,max_sizes}viaruntime_get_adstack_metadata_field_ptrsso generated code reads them at every push / load-top site.ensure_adstack_heap(per_thread_stride * dispatched_threads).Shapes the grammar covers are sized exactly. Shapes outside the grammar (e.g. a
range(field[i])/range(ndarray[i])indexed by a loop-carried variable that the structural walker does not fold) fall back todefault_ad_stack_sizeand emit aQD_WARNnaming the source location so the user can either restructure the loop or extend the grammar.Mechanism end-to-end
1. Compile-time
SizeExprcapturequadrants/transforms/determine_ad_stack_size.cppruns a Bellman-Ford-first, structural-walker-second pre-pass that attempts to express eachAdStackAllocaStmt's max depth as aSizeExprtree. Grammar covered:SizeExpr::make_constGlobalLoadStmtof a scalari32/i64SNodemake_field_loadwith the SNode pointer + constant index pathExternalTensorShapeAlongAxisStmtmake_ndarray_shape(arg_id, axis)make_ndarray_read(arg_id, index_path)LoopIndexStmtof an enclosing bounded range-for / parallel-formake_loop_index_max(bounded by the loop'sendor amaxover every push site)AdStackLoadTopStmtof a stash adstackmake_stack_load_top_max(bounded by amaxover every push site)Inner operators:
add,sub,mul(const * anything viaAdd-replication),max.expr_sub's MaxOverRange fusion caps the fused iteration range atmin(shape_a, shape_b)(expressed in-grammar asa + b - max(a, b)) so the fused body cannot read the shorter ndarray past its buffer end when a caller passes mismatched-length ndarrays. Anything else falls back todefault_ad_stack_sizeand emits theQD_WARNnoted above.SizeExprobjects are immutable after the pre-pass writes them and are serialised as post-orderSerializedSizeExpr::nodesso offline-cache hits evaluate the same way as fresh compiles.2. Codegen reads stride / offset / max-size from runtime metadata
codegen_llvm.cppAdStackAllocaStmtheap + linear_tid * runtime->adstack_per_thread_stride + runtime->adstack_offsets[stack_id](all three loaded at runtime)codegen_llvm.cppAdStackPushStmtstack_push(runtime, stack, runtime->adstack_max_sizes[stack_id], element_size)(cap is a runtime load, not an IR immediate)init_offloaded_task_functionpre-scanAdStackAllocaStmt::stack_idin declaration order;ad_stack_offsets_becamestd::vector<std::size_t>indexed bystack_idThe heap base address is unchanged; only the per-thread stride and per-alloca offset math was moved from codegen-time immediates to runtime loads, so kernels compiled with the old flow will recompile but do not break semantics.
3. Host evaluator
adstack_size_expr_eval.{h,cpp}walks the flat post-order tree:SNodeRwAccessorsBankto read the current scalar value on the host side.LaunchContextBuilderwhich captures both the pointer and the shape on kernel launch.LoopIndexMaxandStackLoadTopMaxfold to whichever launch-time bound they were captured against.int64_t.The evaluator runs strictly on the host, before the kernel dispatches, so there is no device-side evaluator to keep in sync.
4. Per-kernel metadata bridge
runtime_get_adstack_metadata_field_ptrs(LLVMRuntime *)inruntime.cppreturns pointers toadstack_{per_thread_stride,offsets,max_sizes}. The host writes into these fields once per dispatch. On CPU the fields are read directly fromLLVMRuntime; on CUDA / AMDGPU the bridge triggers amemcpy_host_to_deviceso the generated code's subsequent loads see the right values. The memcpy is gated on the task actually carrying an adstack: kernels with no reverse-mode adstacks skippublish_adstack_metadataentirely and pay no per-launch cost.Per-backend coverage matrix
cpu/kernel_launcher.cppdefault_ad_stack_sizefallback +QD_WARNoutsidememcpy_host_to_devicecuda/kernel_launcher.cppmemcpy_host_to_device(host pointer routed through the launcher's device-sideRuntimeContextstaging copy)amdgpu/kernel_launcher.cppTests
tests/python/test_adstack.pytest_adstack_field_load_bounded_loop_evaluated_per_launch- pins the core guarantee: a kernel withfor _ in range(n[None])bound by an i32 scalar field. Launches atn_iter in (1, 20, 50)and asserts the heap resized correctly (no overflow at 50, no over-allocation at 1), with the decorator set todefault_ad_stack_size=2to prove the knob is bypassed when the grammar resolves the bound symbolically.test_adstack_near_capacity- rewritten to exercise both sides of the olddefault_ad_stack_size=32boundary atn_iter in (30, 100)without any knob override. Arch-restricted toqd.cpuuntil the SPIR-V sizer PR lands.test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends/..._with_mismatched_shape_ends- pin theexpr_subMaxOverRange fusion for both the strict-equality and the cross-ndarray same-axis paths.test_adstack_fuses_sub_of_max_over_range_with_mismatched_lengths_is_safe- pins themin(shape_a, shape_b)end cap so the cross-ndarray fusion stays in-bounds when the two ndarrays have different lengths at launch time (pre-fix the fused body read the shorter ndarray past its buffer end).test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index/test_adstack_inner_range_bounded_by_multidim_ndarray_read/test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var- cover theExternalTensorReadgrammar leaves, including multi-axis stride handling and the stash-then-reload pattern.tests/cpp/transforms/determine_ad_stack_size_test.cppSizeExprBuildFromRangeFor- const bounds fold to aConstleaf.SizeExprBuildFromFieldLoad- scalar i32 field load builds aFieldLoadleaf with the right SNode + index path.SizeExprBuildFromLoopIndex- innerfor j in range(i)under an enclosing bounded loop folds toLoopIndexMax.SizeExprRejectsUnsupportedOp- negative control that an unsupported operator routes through the warn-and-fallback path.Deliberate coverage gap: no SPIR-V cross-check here - the SPIR-V runtime metadata path lives in the stacked SPIR-V sizer PR and its tests cover the shader-side reading.
Side-effect audit
quadrants/analysis/offline_cache_util.cppserialisesSerializedSizeExpr::nodesalongside the existingsize_in_bytesperAdStackAllocaStmtdefault_ad_stack_size.statements.h:AdStackAllocaStmt::size_expris astd::shared_ptr<SizeExpr>cloned via the defaultQD_DEFINE_ACCEPT_AND_CLONESizeExpris immutable post pre-pass; shared pointer is safe to share across clonesir/ir.cppAdStackAllocaStmt::reprprintsstack_idand a shortSizeExprsummaryAdStackAllocaStmt::stack_id == -1falls through to the olddefault_ad_stack_size-backed pathSizeExprstill behave as beforeSizeExpris stored onAdStackAllocaStmt, not emitted as IR operands, so it does not participate in CSEadstack_{per_thread_stride,offsets,max_sizes}initialised to zero / null inruntime_initialize; kernels with no adstacks read nothing from them and skippublish_adstack_metadataentirelymemcpy_host_to_deviceexpr_sub's cross-ndarray fusion caps the fused end atmin(shape_a, shape_b)Runtime-evaluated SizeExpr for SPIR-V adstack sizing
TL;DR
On LLVM backends (#550) the reverse-mode adstack for
vis already sized per-launch from the evaluatedSizeExprtree. This PR extends the same behaviour to the Metal and Vulkan backends:compute.grad(a)now sizes the per-thread heap to exactlyn = a[i]slots per outer iteration on SPIR-V too, where before it either baked the compile-time fallback into the shader or hit amax_size > 0codegen assert on dynamic bounds.Why
#550 landed
SizeExpr-at-launch-time sizing for LLVM and seeded adefault_ad_stack_sizecompile-time fallback into the SPIR-V immediates so the existing SPIR-V codegen kept asserting cleanly. That seed is safe but wasteful: the SPIR-V shader's per-thread heap slice still has to be provisioned at the worst-casedefault_ad_stack_sizefor every alloca that was not statically const-foldable, even when the live field values would give a much tighter bound at launch. Kernels like the one above allocaten_outer_iters * default_ad_stack_size * 8 bytesper loop-carried variable on the device; atdefault_ad_stack_size=256with a few dozen allocas that is tens of MB per thread on Metal - crossing the driver'smaxBufferLengthon anything larger than a toy workload.The LLVM infrastructure is a complete blueprint:
AdStackAllocaStmt::size_expris captured duringdetermine_ad_stack_size,OffloadedTask::ad_stack.allocas[i].size_exprpersists through the offline cache, andLlvmRuntimeExecutor::publish_adstack_metadataevaluates each tree right before each dispatch. What was missing on SPIR-V was the final per-launch metadata hand-off to the shader. This PR adds it: one newBufferType::AdStackMetadataSSBO, one newTaskAttributes::ad_stackstruct carrying per-alloca metadata plus serializedSizeExpr, oneGfxRuntime::launch_kernelprelude that evaluates the trees and writes the buffer, and codegen rewrites everyAdStackAllocaStmt/Push/LoadTop/AccAdjointsite to read its stride / offset / max_size from that buffer instead of emitting compile-time immediates.This PR also tightens the grammar-unresolved path on both backends: where #550 falls back to
default_ad_stack_sizewith aQD_WARN, this PR flips the fallback to aQD_ERRORso any adstack outside the grammar is a compile-time error naming the source location. Shapes inside the grammar are sized exactly per launch; shapes outside the grammar are rejected; thead_stack_size=Nescape hatch inqd.init()is retained for stress tests and for bypassing the sizer when debugging.Entry point
The adstack pre-pass
determine_ad_stack_sizealready runs for all backends (gated onautodiff_mode == kReverse && ad_use_stack) and populatesAdStackAllocaStmt::size_expr; this PR only wires SPIR-V codegen and theGfxRuntimelauncher to honour that metadata. The relevant files:quadrants/codegen/spirv/kernel_utils.h- addsTaskAttributes::AdStackSizingAttribs/AdStackAllocaAttribs(replaces the scalarad_stack_heap_per_thread_stride_{float,int}fields) andBufferType::AdStackMetadata.quadrants/codegen/spirv/spirv_codegen.cpp- serialises each alloca'ssize_exprintotask_attribs_.ad_stack.allocas; emits runtime loads for stride / offset / max_size viaensure_ad_stack_metadata_loaded.quadrants/codegen/spirv/adstack_sizer_shader.cpp/.h- the on-device compute shader that evaluates each task'sSizeExprbytecode tree and writes per-task stride / offset / max_size into theAdStackMetadatabuffer.quadrants/runtime/gfx/runtime.cpp::GfxRuntime::launch_kernel- evaluates every task'sSizeExprtree upfront (before the cmdlist opens, to serialise against any reentrantSNodeRwAccessorsBankreader-kernel launches), then uploads per-task metadata to theAdStackMetadatabuffer and binds it alongside the existing heap buffers.quadrants/runtime/gfx/runtime.h::GfxRuntime::Params::program_impl- propagates theProgramImplback-reference soevaluate_adstack_size_expr(...)can reachProgramImpl::program(same pattern the LLVM side uses).Mechanism end-to-end
1. Codegen: serialise per-alloca metadata into
TaskAttributescodegen/spirv/kernel_utils.hAdStackAllocaAttribs { heap_kind, offset_in_elems_compile_time, max_size_compile_time, SerializedSizeExpr size_expr }codegen/spirv/kernel_utils.hAdStackSizingAttribs { per_thread_stride_{float,int}_compile_time, allocas[] }(replaces the two scalarad_stack_heap_per_thread_stride_*fields)codegen/spirv/spirv_codegen.cpp::visit(AdStackAllocaStmt)attribs.size_expr = stmt->size_expr->serialize()when the pre-pass populated itThe pre-scan that sums per-type strides stays put; it just seeds the compile-time fallback values that the host uses when
size_expris empty (offline-cache-hit path without a serialised tree, i.e. pre-PR offline caches).2. Codegen: runtime loads at every adstack site
invoc_id * strideatget_ad_stack_heap_thread_base_{float,int}OpIMul(invoc_id, uint_immediate(stride))OpIMul(invoc_id, OpLoad(metadata_buffer[0 or 1]))offsetatad_stack_heap_{float,int}_ptruint_immediate(heap_primal_offset)OpLoad(metadata_buffer[2 + 2*stack_id])(cached per alloca oninfo.offset_val)max_sizeatAdStackPushStmtandad_stack_top_indexuint_immediate(stmt->max_size)/uint_immediate(max_size - 1)OpLoad(metadata_buffer[2 + 2*stack_id + 1])(cached oninfo.max_size_val) + oneOpISubfor themax - 1capuint_immediate(primal + max_size)OpIAdd(info.offset_val, info.max_size_val)- one derived add instead of a second buffer loadEach
AdStackSpirv::infocaches its SSA ids foroffset_val/max_size_val/adjoint_offset_valon first use and reuses them at every push / load / load_top_adj / acc_adjoint site, so each alloca pays one metadata read regardless of how many push/pop sites it has in the shader body.3. Sizer shader capability gate + scope-state safety
build_adstack_sizer_spirvhard-errors through the empty-return path unless the device advertisesspirv_has_physical_storage_buffer,spirv_has_int64,spirv_has_int8, andspirv_has_int16. Theemit_psb_load_i64per-dtype switch callsir.i{8,16}_type()unconditionally - without those capabilities the accessors return default-constructedSType(id=0)andspirv-valrejects the binary - so gating on all four surfaces a clear "legacy device missing a required hardware feature" diagnostic at launcher time instead of an invalid-bytecode rejection at pipeline creation.GfxRuntime::launch_kernelmirrors the gate so the launcher's error message and the shader builder's return agree.The shader body also clears
scope[pending_var_id]to zero on everyMaxOverRangepop (done_lbl).scope_arris a single function-storage alloca zero-initialised once atmain()entry, butvar_id_counterresets per alloca, so different stacks in the same task reusescope[0],scope[1], and so on. Without the pop-path clear, the NEXT stack's outer linear pre-order walk reads the previous stack's terminal bound value as anExternalTensorReadindex into a potentially-smaller ndarray - out-of-bounds on Metal (hung command buffer) or silently-zero on Vulkan withrobustBufferAccess. The clear preserves the "index 0 is always valid for any non-empty ndarray" invariant the outer walk relies on.4. Runtime: upfront metadata evaluation + buffer upload
GfxRuntime::launch_kernelevaluates every task'sSizeExprtrees beforeensure_current_cmdlist()opens the per-launch cmdlist. This is the load-bearing ordering change: the evaluator can callSNodeRwAccessorsBank::read_int(...)to resolve a field-load-bounded expression, which itself launches a reader kernel throughGfxRuntime::launch_kernel- recursing into this same function with the outer cmdlist half-built would stompcurrent_cmdlist_. Upfront evaluation keeps the reader kernels serialised against any in-flight cmdlist through the normal submit/synchronize path.The metadata buffer is allocated per task (not pooled across the launch) so a sibling task's host memcpy cannot race-overwrite this task's values between cmdlist record and cmdlist submit. Each per-task buffer is retired into
ctx_buffers_so it stays alive for the full submit/sync window. Layout:Host write happens through a
device_->map_range(...)memcpy; unmap before bind. The shader reads via the sameBufferType::AdStackMetadataaccess-chain plumbing as every other StorageBuffer in the task's bind set.5. Program <-> runtime plumbing
GfxRuntime::Paramsgains aProgramImpl *program_implfield, set byMetalProgramImpl::materialize_runtimeandVulkanProgramImpl::materialize_runtimetothis. The runtime uses it to reachprogram_impl_->programforevaluate_adstack_size_expr(...).Program::Programalready setsprogram_impl_->program = thispost-materialize_runtime, so the back-reference is non-null by the time anylaunch_kernelruns.Per-backend coverage matrix
The SPIR-V backends also continue to honour the
max_size_compile_timefallback on the offline-cache-hit path where the symbolicSizeExprtree was not serialised (emptysize_expr.nodes); the runtime falls back to it verbatim, clamping to>= 1like the LLVM counterpart does.Tests
New tests
test_adstack_bounded_inner_loop_sized_by_structural_prepass(tests/python/test_adstack.py): pins the SPIR-V structural pre-pass correctness - a kernel whose inner range-for trip count is a constant product that the pre-pass must fold while the CFG analyzer would fall back to a pessimistic default. Arch-restricted to Metal / Vulkan because the LLVM backend routes the same kernel through the symbolic-tree path instead.test_adstack_spirv_metadata_per_task_buffer(tests/python/test_adstack.py,cfg_optimization=False): pins the per-taskAdStackMetadatabuffer fix - with a shared buffer the second offload's host memcpy overwrites the first offload's metadata before submit, causing anAdstack overflow (offending stack_id=0)atqd.sync()even though each stack's sizer-computed bound is correct.cfg_optimization=Falseis load-bearing; it keeps the two sibling offloads separate so the record-then-execute race surfaces.AdStackSizerShader.GateReturnsEmptyWhenRequiredCapIsMissing(tests/cpp/codegen/adstack_sizer_shader_test.cpp): pins the capability gate - dropping any one of PSB, Int64, Int8, Int16 from a synthetic caps config flipsbuild_adstack_sizer_spirvto the empty-return path.AdStackSizerShader.DumpBinary(tests/cpp/codegen/adstack_sizer_shader_test.cpp): sanity-only smoke test that dumps the compiled binary to/tmp/adstack_sizer.spvfor localspirv-val/spirv-disuse.Widened arch coverage
Existing tests that exercised the runtime-evaluated adstack on LLVM only now run on SPIR-V too (
test_adstack_field_load_bounded_loop_evaluated_per_launch,test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index,test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var,test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends,test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_mismatched_shape_ends) by dropping thearch=[qd.cpu, ...]restriction that scoped them to the LLVM sizer path.Known coverage gaps
No dedicated Python regression test for the
scope[pending_var_id]pop-path clear. The failure mode is hardware-dependent: on Metal / MoltenVK and Vulkan-with-robustBufferAccessthe spurious pre-pass out-of-bounds read is either caught silently or returns zero, both of whichpending_max_accum(which starts at 0 at every MOR push) absorbs without surfacing. A test that fails deterministically would need a Vulkan device without robust buffer access or a simulated evaluator. The fix is mechanism-verified through code review and covered indirectly by the full adstack suite running through the sizer.Full suite:
pytest tests/python/test_adstack.py -n 8-> 712 passed, 7 xfailed on macOS arm64 (CPU + Metal + Vulkan / SwiftShader); wider backend coverage runs through the Linux AMDGPU box.Side-effect audit
TaskAttributes::QD_IO_DEF(..., ad_stack)via the newAdStackSizingAttribs::QD_IO_DEF(per_thread_stride_{float,int}_compile_time, allocas)andAdStackAllocaAttribs::QD_IO_DEF(heap_kind, offset_in_elems_compile_time, max_size_compile_time, size_expr)TaskAttributeschange. Pre-PR caches without the new fields are invalidated on first load and recompiled.TaskCodegen::binding_head_unchanged;AdStackMetadatais appended lazily viaget_buffer_value(...)like every other buffer kind, so existing bindings (Args, Rets, Root, ExtArr, AdStackOverflow, AdStackHeap*) keep their bindingsattribs.ad_stack.allocas.empty()short-circuits both the runtime evaluation loop and the metadata buffer allocation; codegen never callsget_ad_stack_metadata_buffer()in that case, so noBufferType::AdStackMetadataentry is added tobuffer_bindsensure_current_cmdlist()inlaunch_kernelSNodeRwAccessorsBank::read_intreader-kernel launches now go through the normal submit/sync path; no partially-built cmdlist to stomp.max_sizeimmediatesad_stack_heap_per_thread_stride_{float,int}_as a compile-time sum and carries them intoAdStackSizingAttribs::per_thread_stride_{float,int}_compile_time; the runtime uses those when every alloca'ssize_expr.nodes.empty()(pre-PR offline cache hit)invoc_id * strideget_ad_stack_heap_thread_base_{float,int}still eagerly emits at the first alloca site, just withOpLoadof the metadata stride instead ofuint_immediateAdStackMetadatabuffer and retires the previous one intoctx_buffers_scope_arrreusedone_lblclearsscope[pending_var_id]to zeroadstack_sizer_shader.cppcomments use ASCII arrows (->) only