[AutoDiff] Autodiff 10: Implement adstack for SPIR-V#490
Conversation
8a157bf to
4688db1
Compare
|
description from opus: SummaryImplements the autodiff local-history stack ( What's in the PR1. SPIR-V
|
|
Could you update the doc to reflect these changes please. (create new autodiff doc, if we don't already have any; ok for that to be a separate PR, but this PR should follow such a separate PR please. Context is: I like using the doc as the 'gold standard' on which to base test plan etc). |
93330ba to
c1a8c83
Compare
ed840fd to
4faff0b
Compare
|
|
||
| The workflow is: | ||
|
|
||
| 1. Allocate an adjoint (`.grad`) buffer next to every primal field gradients are needed for. |
There was a problem hiding this comment.
I don't see where this is done in the self-contained example below?
| x = qd.field(qd.f32) | ||
| y = qd.field(qd.f32) | ||
| qd.root.dense(qd.i, 16).place(x, x.grad) | ||
| qd.root.place(y, y.grad) |
There was a problem hiding this comment.
Add a comment here like #1. allocate an adjoint
|
|
||
| ### Forward-mode AD via `qd.ad.FwdMode` | ||
|
|
||
| Forward mode propagates a tangent vector alongside the primal in a single forward pass and writes the directional derivative into a `.dual` companion field. The direction (the "seed") is fixed upfront; the result is a Jacobian-vector product. |
There was a problem hiding this comment.
this seems to be refrring to a specific example?
| Reverse mode returns every input gradient of one scalar output per pass; forward mode returns every output derivative along one input direction per pass. Pick accordingly: | ||
|
|
||
| - Few inputs, many outputs: forward mode. Example: one kinematic parameter of a robot, derivative of every joint with respect to it. | ||
| - Many inputs, one scalar loss: reverse mode. Example: loss over a million network weights. This is the training default. |
There was a problem hiding this comment.
inconsistent ordering vs above. Also, isn't this duplicating what you wrote above? maybe cut?
|
|
||
| ### Forward-mode AD via `qd.ad.FwdMode` | ||
|
|
||
| Forward mode propagates a tangent vector alongside the primal in a single forward pass and writes the directional derivative into a `.dual` companion field. The direction (the "seed") is fixed upfront; the result is a Jacobian-vector product. |
There was a problem hiding this comment.
Very unclear to me from this how FwdMode relates to what we were disussing earlier. Could we add some higher level summary of what challenge FwdMode is trying to solve, and how it solves it.
|
|
||
| ### Overriding the compiler-generated gradient | ||
|
|
||
| Source-transforming the forward IR is correct by construction but not always desirable: |
There was a problem hiding this comment.
What does IR have to do with anything? Seems like a very low level concept, out of place in user facing doc?
There was a problem hiding this comment.
I find the doc readable up until about line 83. Then it just becomes like https://github.com/s-macke/Abstruse-Goose-Archive/blob/master/comics/474.md
Could we somehow add some higher level overview of the steps we are walking please?
| update_b() | ||
| ``` | ||
|
|
||
| Under `validation=True`, each `needs_grad=True` scalar field gets a companion single-byte checkbit field (`i32` on Vulkan). The compiler rewrites every forward kernel in the tape so that a `GlobalLoadStmt` sets the checkbit to 1 and every subsequent `GlobalStoreStmt` / `AtomicOpStmt` asserts the checkbit is still 0. Checkbits are cleared on tape entry. A violation raises `QuadrantsAssertionError` with the offending snode name and traceback. Kernels wrapped in `qd.ad.grad_replaced` are skipped; their gradient is the user's responsibility. |
There was a problem hiding this comment.
This seems way too much detail, and hsould go in an 'under the hood' seciotn?
|
|
||
| Under `validation=True`, each `needs_grad=True` scalar field gets a companion single-byte checkbit field (`i32` on Vulkan). The compiler rewrites every forward kernel in the tape so that a `GlobalLoadStmt` sets the checkbit to 1 and every subsequent `GlobalStoreStmt` / `AtomicOpStmt` asserts the checkbit is still 0. Checkbits are cleared on tape entry. A violation raises `QuadrantsAssertionError` with the offending snode name and traceback. Kernels wrapped in `qd.ad.grad_replaced` are skipped; their gradient is the user's responsibility. | ||
|
|
||
| Validation adds per-access runtime work and extra memory, so it is opt-in and only honored under `debug=True`. Use it while developing a new differentiable kernel; drop it in production. |
There was a problem hiding this comment.
what does valdiation do? why do we want it?
|
|
||
| Automatic differentiation (autodiff) computes the exact gradient of a kernel's output with respect to its inputs, without the user writing the derivative formulas by hand. Gradient-based optimizers then use this gradient to train neural networks, fit physical models to data, drive differentiable simulators, or solve inverse problems. | ||
|
|
||
| Quadrants implements autodiff as a source-to-source IR transform: when `.grad()` is requested, the compiler emits a companion kernel that runs on the same backend as the forward one and writes gradients into the primal fields' `.grad` companions. There is no Python-side tape, no per-op dispatch overhead, and no dependency on an external AD framework - the backward pass is a fused kernel, usually only marginally slower than the forward. |
There was a problem hiding this comment.
why does it run slower than forward mode?
| - Checkpointing: re-run part of the forward on the backward pass instead of keeping intermediates. | ||
| - `qd.ad.Tape` needs to drive a section whose gradient is supplied by hand, while auto-differentiating everything around it. | ||
|
|
||
| `qd.ad.grad_replaced` decorates a plain Python function wrapping one or more kernel calls; `qd.ad.grad_for(primal)` decorates the function that plays the role of its gradient. `Tape` runs the replaced forward on entry and the user-supplied gradient on exit, bypassing the auto-generated one. |
There was a problem hiding this comment.
I think this reads more like a reference section, and less like a step by step manual
| Constraints: | ||
|
|
||
| - The decorated forward must be a regular Python function, not a `@qd.kernel`. Wrap kernels inside a Python function. | ||
| - Under `validation=True` (see below), custom-gradient sections are exempt from the global data access rule; correctness is the user's responsibility. |
There was a problem hiding this comment.
lets not discuss things we haven't introduced clearly yet. ie lets avoid forward references such as 'see below'.
|
|
||
| ### Global data access rules and the validation checker | ||
|
|
||
| Source-transformed reverse-mode AD is correct only when the forward kernel obeys two rules on global memory access (primal fields and ndarrays): |
There was a problem hiding this comment.
have we defined what 'source-transforemd ' means? Ithink we've hinted at it. But I don't think we've explicitly defined it?
|
|
||
| Source-transformed reverse-mode AD is correct only when the forward kernel obeys two rules on global memory access (primal fields and ndarrays): | ||
|
|
||
| 1. Within a single kernel launch, a field entry that has been read must not be written to afterward. Overwriting after reading destroys the primal value the reverse pass needs for local partials. |
There was a problem hiding this comment.
this says 'field'. What about 'ndarray'?
| Source-transformed reverse-mode AD is correct only when the forward kernel obeys two rules on global memory access (primal fields and ndarrays): | ||
|
|
||
| 1. Within a single kernel launch, a field entry that has been read must not be written to afterward. Overwriting after reading destroys the primal value the reverse pass needs for local partials. | ||
| 2. Different kernel launches may read and write the same field freely; the constraint is strictly per-launch. |
There was a problem hiding this comment.
why is this a rule? this seems more like a commentary. The previous rule already said 'within the same kernel launch'.
| 1. Within a single kernel launch, a field entry that has been read must not be written to afterward. Overwriting after reading destroys the primal value the reverse pass needs for local partials. | ||
| 2. Different kernel launches may read and write the same field freely; the constraint is strictly per-launch. | ||
|
|
||
| Most violations follow the "read `x[i]`, then overwrite `x[i]`" pattern, often in the form of an in-place update like `x[i] = x[i] + dt * v[i]` inside a loop that also reads `x[i]` earlier in the body. The fix is typically to split the update across two fields (double-buffer with `x_new`) or across two kernels. |
There was a problem hiding this comment.
we havent introduced this pattern yet. Lets first introduce an example that violates the rule, then name this the 'read x[i] then overwrite x[i]' pattern.
|
|
||
| Most violations follow the "read `x[i]`, then overwrite `x[i]`" pattern, often in the form of an in-place update like `x[i] = x[i] + dt * v[i]` inside a loop that also reads `x[i]` earlier in the body. The fix is typically to split the update across two fields (double-buffer with `x_new`) or across two kernels. | ||
|
|
||
| Violating example: |
There was a problem hiding this comment.
put this first, above the previosu paraagarph.
| b[None] += 100 | ||
| ``` | ||
|
|
||
| Fixed: |
There was a problem hiding this comment.
Explain the process we did to fix it. (separate into two kernels I g uess?)
| update_b() | ||
| ``` | ||
|
|
||
| Quadrants can check rule 1 at runtime. Pass `validation=True` to `qd.ad.Tape`, with `qd.init(debug=True)`: |
There was a problem hiding this comment.
"To get Quadrants to validate rule 1 at runtime, pass ..."
|
|
||
| Validation adds per-access runtime work and extra memory, so it is opt-in and only honored under `debug=True`. Use it while developing a new differentiable kernel; drop it in production. | ||
|
|
||
| ## The adstack pipeline |
There was a problem hiding this comment.
## Autodiff with dynamic loops
There was a problem hiding this comment.
i.e the problem we are trying to solve, not hte implemeatantion.
|
|
||
| ## The adstack pipeline | ||
|
|
||
| Reverse-mode AD through a dynamic loop must remember, for every iteration, the primal values the reverse pass needs to evaluate local partials. Quadrants stores this history in an autodiff stack (adstack): a per-variable stack the forward pass pushes to and the reverse pass pops from. |
There was a problem hiding this comment.
"Reverse-mode AD through a dynamic loop must remember the forward outputs for every iteration. To do this, we need to enable a feature called 'adstack'."
bf2377f to
4f9b5fb
Compare
|
@claude review |
39c8017 to
ba3ab98
Compare
|
@claude review |
There was a problem hiding this comment.
All bugs from my prior review rounds have been fixed; deferring to a human for final sign-off given the size of the autodiff transform and SPIR-V codegen changes.
Extended reasoning...
Overview
PR #490 adds six new SPIR-V visitors for the family, wires the extension for Metal and Vulkan, extends the GFX runtime to round-trip host grad tensors through side-allocated device buffers, fixes the Metal RHI to surface null-pipeline creation as a , adds guards against destructor-side throws, and delivers several autodiff-transform correctness fixes ( / // save-restore, cursor-vs-backup comparison across all visitors, alloca_block leak). A 250-line user-guide doc () is added. 14 files changed.
Security risks
No auth, crypto, or permission-sensitive paths touched. The new SSBO flag is single-u32 and host-polled after ; the used for the overflow write is race-safe. No injection vectors introduced.
Level of scrutiny
High. The changes are backend-agnostic and affect every existing CUDA/CPU/AMDGPU autodiff user, not just the new Metal/Vulkan path. The GFX runtime grad-buffer round-trip is new infrastructure. Both deserve human eyes on correctness and test coverage before the flag de-experiments.
Other factors
All six rounds of bugs I found were fixed by the author with clear explanations. Test surface is extensive ( +292 lines, plus updated ). One design choice — unconditional grad blit in — was declined with sound rationale (the bitmask tracks the data slot only; gating on WRITE would silence backward seeds). The remaining open threads are doc-quality nits from the human reviewer, not correctness issues.
|
Opus says this is multiple streams of work bundled togehre: Three loosely-coupled streams of work bundled together: (a) port adstack to SPIR-V backends (codegen + runtime + Metal/Vulkan extension flag), (b) fix a half-dozen distinct |
User: look at the diff between duburcqa/fix_adstack_perf and origin/duburcqa/llvm_adstack_safetyAssistant: Quick check: your local User: yesAssistant: Done. Local is now at Want me to do the diff between this branch and User: yesAssistant: Diff between Themes1. Enable adstack on SPIR-V backends (Metal, Vulkan)
2. Reverse-mode AD correctness fixes in
|
| # | Theme | Test that pins it |
|---|---|---|
| 1 | autodiff.md docs | n/a (docs PR) |
| 2 | AdStackAllocaJudger direct-walker fix (drop or bundle defensive part) |
test_adstack_inner_for_bound_is_enclosing_loop_index |
| 3 | MakeAdjoint::visit(RangeForStmt) current_block save/restore |
runtime-guarded-if-wraps-for test (in test_adstack.py additions) |
| 4 | MakeAdjoint+MakeDual::visit(StructForStmt) save/restore |
bundled because forward-mode may lack a dedicated test |
| 5 | MatrixPtrStmt(AdStackLoadTopStmt) slot-load fix |
tensor-subscript-on-stack test in test_adstack.py additions |
| 6 | BackupSSA::visit(Range/StructForStmt) cross-block operand spill |
inner-for k in range(j) test |
| 7 | Stack-backed IfStmt cond snapshot |
short-circuit-&&-pushes-cond-stack test |
| 8 | Sibling for-loop reverse order in non-IB blocks | needs verification — does the diff include a test specifically for this? |
| 9 | For-bound alloca promotion | bundle with #2 if same test covers both, else dedicated test |
| 10 | SPIR-V port (codegen + runtime + Metal pipeline error + extension flag) | strengthened shader-compile-failure test + near-capacity gradient tests |
| 11 | SPIR-V ndarray grad buffer plumbing | test_ad_ndarray_torch.py arch-list cleanup |
That's 11 if everything separates cleanly with its own test, 7–8 if some need to be bundled because their tests overlap or one of the fixes lacks dedicated coverage.
What I'd verify before finalizing the split
Three things to spot-check:
MakeDual::visit(StructForStmt)— is there a forward-mode test in the diff? If yes, split it; if no, bundle with [Build] Add Windows github runner #4.- Sibling for-loop reverse order (commit
2ea9514f3) — is there a test specifically for it? If not, either add one or fold into whichever existing test happens to exercise the kernel shape. - For-bound alloca promotion (
46f71ed40) — does it have its own test or share one with the judger fix? If shared, bundle into one PR; if separate, split.
That spot-check would settle whether the count is closer to 7 or to 11.
Process implication
The "each PR has its own test" constraint is a healthy filter — it forces you to confront cases where the diff contains code without coverage. In this PR there are at least two such cases (the defensive walker recursion and possibly the MakeDual struct-for fix). Under the constraint they either need test coverage added before splitting, or they get bundled with adjacent code that does have a test. Both are better outcomes than splitting them out as orphans.
|
Please split into the following PRs:
|
|
(that sounds like a lot, so if you want to argue for fewer, then please go ahead; but lets split it somewhat at least please). |
|
I'm not ok with splitting and leave the code in completely unusable state. I will try to split, but I don't think it is possible to ship something that works while doing this. |
2e738b1 to
e4c60f6
Compare
311395f to
efcee21
Compare
|
checklist:
=> ok to merge |
… handles bare AdStackLoadTopStmt and leaves compound conds to BackupSSA's spill branch
…ot only handles bare AdStackLoadTopStmt and leaves compound conds to BackupSSA's spill branch
| // from every stmt nested inside the for-loop's body that references an outer-block stmt as a | ||
| // free variable. The body-use gather is what catches the case where the later sibling | ||
| // for-loop consumes a non-loop outer-block stmt `S` inside its body (e.g. `for_B: body reads | ||
| // S`) rather than through `for_B`'s range bound: `RangeForStmt::get_operands()` returns only | ||
| // `{begin, end}`, so without walking the body `S` would miss `must_hoist`, the pairwise swap | ||
| // would place `for_B` ahead of `S`, and the IR verifier would reject the SSA violation. | ||
| for (int fi : for_indices) { | ||
| for (Stmt *op : block->statements[fi]->get_operands()) { | ||
| push_if_internal(op); | ||
| } | ||
| Stmt *for_stmt = block->statements[fi].get(); | ||
| irpass::analysis::gather_statements(for_stmt, [&](Stmt *body_stmt) { | ||
| for (Stmt *op : body_stmt->get_operands()) { | ||
| push_if_internal(op); | ||
| } | ||
| return false; | ||
| }); | ||
| } | ||
| while (!stack.empty()) { | ||
| Stmt *s = stack.back(); | ||
| stack.pop_back(); | ||
| for (Stmt *op : s->get_operands()) { | ||
| push_if_internal(op); | ||
| } | ||
| } |
There was a problem hiding this comment.
🔴 The must_hoist walker in reverse_for_loop_order_in_place (auto_diff.cpp:1058-1069) only invokes the lambda on non-container body statements when walking each sibling for-loop. When a non-loop stmt S sits between two sibling for-loops and is referenced as the begin/end operand of a nested for inside the later sibling's body (e.g. for_outer_2 body contains for_inner_2(end=trip_load), where trip_load is the between-stmt and the inner body does not itself reference trip_load), the walker never calls get_operands() on the nested for, so S is absent from must_hoist. The pairwise swap then lands for_outer_2 ahead of S, creating an SSA dominance violation that the IR verifier rejects at compile time (or, if verification is skipped, causes the shader to read an uninitialised register). Fix: also invoke push_if_internal on container stmts' begin/end/cond operands inside the gather walk - e.g. override preprocess_container_stmt on a dedicated visitor so the lambda fires on the container itself before recursing into its body.
Extended reasoning...
What the bug is
reverse_for_loop_order_in_place seeds its must_hoist frontier at auto_diff.cpp:1058-1069 by calling get_operands() directly on each sibling for-loop and by walking each sibling's body via irpass::analysis::gather_statements(for_stmt, lambda) to discover in-block producers the body consumes. gather_statements uses StmtSearcher (analysis/gather_statements.cpp:7-30), which inherits from BasicStmtVisitor and only overrides visit(Stmt *). BasicStmtVisitor's overrides for RangeForStmt / StructForStmt / IfStmt (basic_stmt_visitor.cpp:20-41) call only preprocess_container_stmt (the default is a no-op) and recurse into the body block - they never dispatch to visit(Stmt *). StmtSearcher does not override preprocess_container_stmt, and although it sets invoke_default_visitor = true, the DEFINE_VISIT-generated default path fires only when visit(T *) is not overridden; BasicStmtVisitor's explicit container overrides take precedence.
Net effect: when gather_statements walks a sibling for-loop's body, the test_ callback (and therefore the push_if_internal seed) is invoked for every plain (non-container) body statement but never for a nested RangeForStmt / StructForStmt / IfStmt. Those containers' own operands (begin, end, cond) are never consulted by the seed walk.
Concrete trigger
Layout inside a non-IB container block:
[for_outer_1, // sibling 1
trip_load = GlobalLoad, // between-stmt at position 1
for_outer_2 { // sibling 2
body: {
for_inner_2(end = trip_load) { // nested-for: end operand points at trip_load
body: uses only its own induction var, no ref to trip_load
}
}
}]
for_indices = [0, 2], first_for = 0. Seeding from for_outer_2->get_operands() yields only {begin_2, end_2} (typically ConstStmt range bounds defined before first_for and skipped by push_if_internal's position filter). gather_statements(for_outer_2, lambda) then walks for_outer_2->body: it encounters for_inner_2, dispatches to BasicStmtVisitor::visit(RangeForStmt) which skips the lambda on the container itself and only recurses into for_inner_2->body. The lambda is then called on body stmts that do not reference trip_load. trip_load is therefore absent from must_hoist.
The hoist phase moves nothing. The pairwise swap lands for_outer_2 at position 0 while trip_load stays at position 1, producing [for_outer_2, trip_load, for_outer_1]. for_inner_2 (inside for_outer_2's body) now references trip_load before trip_load is defined in block order.
Why existing code does not catch it
BackupSSA::generic_visit only spills cross-block operands (op->parent outside the using stmt's leaf-to-root ancestor chain). trip_load and for_outer_2 share the same parent block, so the container block is in for_inner_2's leaf-to-root chain and trip_load is classified as in-scope - no spill is emitted. irpass::analysis::verify running immediately after the pass (auto_diff.cpp:2585) rejects the IR with an SSA dominance violation; on backends that skip verification, SPIR-V codegen's ir_->query_value(trip_load->raw_name()) fails because trip_load has not been registered when the for_inner_2 range visitor fires.
Why existing tests miss it
test_ad_sibling_for_loops_with_dynamic_trip_count_between_themputs the between-stmt as the direct range bound of a top-level sibling for-loop - caught by the first seed loop at auto_diff.cpp:1059-1060 that walksget_operands()of the sibling itself.test_ad_sibling_for_loops_with_body_use_of_between_stmtputs the between-stmt as a free variable inside a sibling-for body directly (aBinaryOpbody stmt referencing it) - caught by thegather_statementswalk because theBinaryOpis a plain (non-container) stmt sotest_is called on it.- No existing test places the between-stmt as the
begin/endoperand of a nested container inside a sibling's body.
Step-by-step proof
- Container block statements (ordered): s0=for_outer_1, s1=trip_load, s2=for_outer_2 (body = [for_inner_2]), where for_inner_2->end = trip_load.
for_indices = [0, 2],first_for = 0,pos_of[trip_load] = 1.- Seed:
for_outer_1->get_operands()= {begin_1, end_1} - both ConstStmts at positions < first_for,push_if_internalskips both.for_outer_2->get_operands()= {begin_2, end_2} - same, skipped. gather_statements(for_outer_1, lambda)walks its body. Assume it references nothing at positions >= first_for.must_hoiststill empty.gather_statements(for_outer_2, lambda)walks its body. The only top-level statement isfor_inner_2.StmtSearcherdispatches toBasicStmtVisitor::visit(RangeForStmt)for it, which callspreprocess_container_stmt(no-op) and recurses intofor_inner_2->body. The lambda is never called withfor_inner_2asbody_stmt, sofor_inner_2->get_operands()(= {begin_inner, trip_load}) is never queried. Assume the body stmts insidefor_inner_2do not referencetrip_loaddirectly.must_hoiststays empty.- Transitive
while (!stack.empty())loop does nothing (stack empty). must_hoistis empty. Hoist phase moves nothing. Suffix = [for_outer_1, trip_load, for_outer_2];suffix_for_positions = [0, 2]. Pairwise swap exchanges positions 0 and 2 -> [for_outer_2, trip_load, for_outer_1].for_inner_2(insidefor_outer_2's body at new position 0) referencestrip_loadat position 1. SSA dominance violated; verify rejects or codegen faults.
Fix
Replace the gather_statements seed walk with a dedicated visitor that invokes the callback on container stmts as well (e.g. override preprocess_container_stmt to call the lambda on the container before recursing), so RangeForStmt / StructForStmt / IfStmt inside a sibling body contribute their own begin / end / cond operands to must_hoist. A minimal patch adds those operands explicitly for every nested for / if encountered during the body walk.
Implement adstack for SPIR-V (Metal, Vulkan)
TL;DR
Each visitor materialises the corresponding inlined SPIR-V. Semantics mirror the LLVM runtime helpers in
runtime.cppone-for-one (push incrementscount_var, pop decrements, load-top reads[count_var-1], acc-adjoint adds intoadjoint_arr[count_var-1]), with no runtime call — everything is emitted as plain SPIR-V.Why
Until now,
MakeAdjointproducedAdStackAllocaStmt/AdStackPushStmt/ etc. for any adstack-opted-in kernel, and the LLVM backends lowered them to runtime-helper calls. SPIR-V had no visitors for these statements — thekernel_compiler.cpphardcodedad_use_stack = falsesoMakeAdjointwould route around them on SPIR-V, and every user who tried to run the adstack extension on Metal or Vulkan got a kernel-level failure or silently-wrong gradients. Bringing feature parity to SPIR-V is the goal.Mechanism
quadrants/codegen/spirv/kernel_compiler.cppSingle-line flip:
Now
MakeAdjointemitsAdStack*Stmton SPIR-V when the user opts into the extension, and the new visitors below handle them.quadrants/codegen/spirv/spirv_codegen.cpp(new visitors)visit(AdStackAllocaStmt): materialises three Function-scope variables per stack — au32 count_var, aprimal_arr, and anadjoint_arr, both of typeArray<elem, max_size>. Stores them inad_stacks_keyed by the alloca stmt.count_varis zeroed at the allocation site.visit(AdStackPushStmt): loadscount_var, stores the new primal atprimal_arr[count], zeroes the matchingadjoint_arr[count], storescount + 1back intocount_var.visit(AdStackPopStmt): loadscount_var, storescount - 1back.visit(AdStackLoadTopStmt)/visit(AdStackLoadTopAdjStmt): loadscount_var, readsprimal_arr[count - 1]/adjoint_arr[count - 1].visit(AdStackAccAdjointStmt): loadscount_var, readsadjoint_arr[count - 1], adds the new adjoint, writes back.The on-chip
Array<T, max_size>design is the same shape as the LLVM Function-scope path. On Metal, Apple's MSL translator caps per-thread Function-scope memory at a few dozen to a few hundred kilobytes depending on model; that cap is the reason Autodiff 12 moves the storage to a shared heap.quadrants/program/extension.cppAdds
Extension::adstackto both the Metal and Vulkan supported-extension sets. Previously both were empty, so any user code that declaredrequire=qd.extension.adstackand tried to run on SPIR-V was skipped silently at the test decorator level.quadrants/program/compile_config.hComment above
default_ad_stack_sizeis rewritten to reflect the new SPIR-V on-chip reality: on SPIR-V the allocation lives in per-thread on-chip memory which the driver caps at a few kilobytes, so the fallback default stays small.quadrants/runtime/llvm/runtime_module/runtime.cppThe existing
stack_pushruntime helper previously wrapped pastmax_num_elementsby a baren++. This PR replaces that with a hardQD_ASSERT(...)so the LLVM path loudly surfaces the same condition that SPIR-V overflow makes silent (due to no bounds-checked GLSL/MSL path). Autodiff 8 subsequently refines this into a catchable Python exception — here we just make sure the pre-existing silent wrap is at least loud.Tests
tests/python/test_adstack.pyEvery Autodiff 1-6 test now runs on Metal and Vulkan as well: the tests are decorated with
require=qd.extension.adstack, and before this PR those arches were silently skipped because the extension was not registered. After this PR, the whole existing adstack test matrix exercises SPIR-V too.New SPIR-V-specific test:
test_adstack_shader_compile_failure_raises— atad_stack_size=65536with four loop-carried f32 variables, Apple's MSL translator rejects the pipeline withXPC_ERROR_CONNECTION_INTERRUPTED. The test asserts the error surfaces as aRuntimeErrormatching "Failed to create pipeline" instead of either crashing the process or silently launching a null pipeline. Scoped to Metal only because Vulkan drivers vary widely on what per-thread Function-scope footprint they will accept (calibrating a single threshold that every CI Vulkan driver rejects is brittle).tests/python/test_ad_if.pyExisting nested-if tests that were skipped on SPIR-V (no
require=adstackgate or an explicit SPIR-V exclude) now run on Metal and Vulkan with the extension enabled.tests/python/test_intrinsics.pyUnrelated small adjustment (parameter change on an existing test) to accommodate a tangentially-related codegen change; does not affect this PR's feature.
Side-effect audit
extension.cpponly affects the SPIR-V supported set.compile_config.ad_stack_experimental_enabledgates everything.stack_pushwrap on LLVMQD_ASSERTaborts loudly; further refined into a Python exception in Autodiff 8.Stack
Autodiff 10 of 13. Based on #495 (budget guard). Followed by #536 (latent adstack fixes).