Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee743e91b5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
This PR adds a structural pre-pass to and guards the Bellman-Ford analyzer from clobbering pre-resolved stack sizes — non-trivial autodiff compiler logic that warrants a human expert reviewing the IR parent-chain walk assumptions and the correctness of the pre-pass/Bellman-Ford interaction.
Extended reasoning...
Overview
The PR touches two files in the autodiff pipeline: (new structural pre-pass) and (guard to skip pre-resolved stacks in the Bellman-Ford analyzer). A test file is updated to keep an existing test on the Bellman-Ford fallback path, and a new regression test is added for SPIR-V backends.
Security risks
None. This is a compiler optimization pass with no security-sensitive surfaces.
Level of scrutiny
The change is correctness-critical: an off-by-one in the pre-pass size bound, or a misclassification of a parent container statement, could silently under-allocate an adstack and produce wrong gradients or runtime overflow crashes. The key assumption is that after offloading the outer parallel loop appears as an (not a ) in the IR parent chain, so the walk correctly accumulates only the inner bounded-range multipliers. This assumption is backend-specific and the PR notes that the LLVM path already differs (LoopIndexStmt rewrite). A reviewer familiar with the IR lowering pipeline and the CFG-analysis interaction should verify this assumption holds across all targeted backends and pipeline configurations.
Other factors
No bugs were found by the automated hunting system. The change is well-motivated (production Metal OOM), well-commented, includes overflow guards, and the test suite is carefully updated. The rewrite to use a runtime field is thoughtful — it preserves the test's purpose of pinning the Bellman-Ford fallback knob. Given the algorithmic complexity and backend-specific IR structure assumptions, this merits a human reviewer with autodiff/CFG expertise.
411dd48 to
027c733
Compare
0c31cb0 to
4ba303d
Compare
027c733 to
be6cdb3
Compare
4ba303d to
9a6420c
Compare
be6cdb3 to
d569cc5
Compare
a2e9cbc to
2f3c71c
Compare
d569cc5 to
0bdbedd
Compare
2f3c71c to
da1f368
Compare
|
Checklist:
=> ok to merge |
c8f36e6 to
e76a5a0
Compare
0bdbedd to
f1963b8
Compare
ef0b228 to
938b32a
Compare
f1963b8 to
8f25e57
Compare
938b32a to
29b4c1a
Compare
8f25e57 to
0455812
Compare
9c959ff to
b5f776f
Compare
0455812 to
6f5a1d5
Compare
b5f776f to
505af49
Compare
6f5a1d5 to
86c528e
Compare
505af49 to
d8fe6ae
Compare
86c528e to
12d82e2
Compare
d8fe6ae to
18351d7
Compare
12d82e2 to
426a253
Compare
18351d7 to
be1e6ae
Compare
426a253 to
8bd76af
Compare
be1e6ae to
89482e4
Compare
8bd76af to
dd68d7e
Compare
89482e4 to
54d7dba
Compare
…efault_ad_stack_size fallback
…d push/pop stacks keep their tight CFG-computed bound
…not trip a null-pool deref after ~32 two-set launches
…e-fors with ndarray-element vs shape trip counts
…M reads same val_int helpers
54d7dba to
d074c0e
Compare
Structural adstack sizing: bounded inner loops skip the
default_ad_stack_sizefallbackTL;DR
Before this PR, the CFG Bellman-Ford analyzer flagged every push inside the
range(5)as a "positive loop" and capped the adstack atdefault_ad_stack_size=1, overflowing at runtime. After, the pre-pass proves the per-thread stack depth is bounded by the product of the enclosing static range trip counts and setsmax_size = 5directly; the default is only consulted for unbounded shapes (while loops,range(field_load)).Why
The heap-backed SPIR-V adstack introduced in #493 sizes its per-dispatch
StorageBufferasad_stack_heap_per_thread_stride * dispatched_threads. On wide-ndrange reverse-mode kernels with a few hundred local variables promoted to adstacks, every adstack whose push lives inside arange(N)inner loop previously fell back todefault_ad_stack_size = 256via Bellman-Ford's positive-loop detection. Concrete numbers on a 600 000-thread reverse-mode grid op: 256 of 563 f32 adstacks plus 64 of 408 i32/u1 adstacks landed on the 256-cap fallback, bloatingad_stack_heap_per_thread_stride_floatto 132 574 slots and pushing the required buffer past 41 GB - hard-failing Metal'smaxBufferLength=28 GBcap withFailed to allocate adstack heap int buffer (size=41305523712)before the dispatch could run.Bellman-Ford's pessimism is by design: it counts pushes and pops per CFG block and flags any cycle whose net push count is positive. It has no loop-bound awareness, so a
range(27)stencil sweep reads the same as awhile Trueloop - both are "positive loop, use the configured default". The traditional workaround, raisingad_stack_sizeordefault_ad_stack_sizeviaqd.init(...), goes the wrong direction - the problem is that 256 is too big for these kernels, not too small.This PR teaches
irpass::determine_ad_stack_sizeto resolve the bounded-loop case up front, without needing Bellman-Ford at all.Surface API
No user-visible API changes. The existing
CompileConfig::default_ad_stack_sizeknob still exists and still behaves the same way for the unbounded fallback path. The new logic lives entirely inirpass::determine_ad_stack_size(quadrants/transforms/determine_ad_stack_size.cpp) and runs automatically as part of every reverse-mode kernel'soffload_to_executablepipeline (same call site as before).Mechanism end-to-end
1. Structural pre-pass in
determine_ad_stack_size.cppirpass::determine_ad_stack_sizeruns before the CFG Bellman-Ford analyzer. For eachAdStackAllocaStmtwithmax_size == 0(adaptive):AdStackPushStmtwhosestack == allocaviairpass::analysis::gather_statements.push->parent->parent_stmt()and moving outward.RangeForStmt-> calltry_eval_const_i32(begin)andtry_eval_const_i32(end). If both fold, multiplymultiplier *= (end - begin). If either fails to fold, mark the alloca as unbounded and stop.StructForStmt,WhileStmt-> mark unbounded and stop (compile-time trip count is not available).IfStmt,MeshForStmtbody, etc.) -> pass through without affecting the multiplier. These do not iterate, so one enclosing execution = one push.alloca->max_size.max_size = 0so Bellman-Ford can still try.The bound is pessimistic with respect to mutually exclusive if-branches (summing pushes across both arms rather than taking the max), which is safe for heap sizing: mutually exclusive pushes waste slots but never under-allocate. Overflow-checks guard both the multiply and the per-push accumulate before writing the final value.
2.
try_eval_const_i32- small arithmetic constant folderRangeForStmt::beginand::endareStmt *pointers, so they may beConstStmtdirectly or a shallowBinaryOpStmt(add/sub/mul/div, ConstStmt, ConstStmt)- the LLVM pipeline in particular sometimes leaves inner-range bounds asBinaryOpStmt(add, ConstStmt(0), ConstStmt(N))becausefull_simplify's constant-fold is not guaranteed to have run in every configuration. Rather than depending onfull_simplify's state, the pre-pass carries a local recursive evaluator that foldsConstStmtleaves and the four arithmeticBinaryOpTypes (add, sub, mul, div-with-nonzero-rhs). Any other shape (includingUnaryOpStmt,LocalLoadStmt,GlobalLoadStmt) returns false and the enclosing range is treated as unbounded.3. Handoff to the existing CFG Bellman-Ford analyzer
After the pre-pass, the existing analyzer runs on the remainder:
The analyzer resolves any adstack whose push/pop pattern it can walk, and falls back to
default_ad_stack_sizefor the genuine positive-loop cases the pre-pass marked unbounded.4. Bellman-Ford skip guard in
control_flow_graph.cppThe existing analyzer had a latent bug that only surfaced once the pre-pass started pre-populating
max_sizevalues: at the end of its per-stack loop it unconditionally wrotestack->max_size = max_size, withmax_sizebeing a local Bellman-Ford variable that stayed 0 for any stack whose push/pop it did not see. That clobbered the pre-pass's results to 0, and codegen would then tripAdaptive autodiff stack's size should have been determined.Fix: at the collection step in
ControlFlowGraph::determine_ad_stack_size, skip adstacks whosemax_sizeis already non-zero. They are not added toall_stacks, not walked by Bellman-Ford, not overwritten at the end.This also fixes the "Unused autodiff stack should have been eliminated" warning path where Bellman-Ford would previously overwrite a non-zero value with 0 on an alloca whose pushes had been DCE'd.
Per-backend coverage matrix
requiredfor the int heap dropped from 41.3 GB to well under Metal's 28 GB cap; the test that previously failed with aFailed to allocate adstack heap int buffererror now runs end-to-end.max_sizevalues were cheap there, so the user-visible effect is a compile-time-only reduction in worker-stack footprint. No behavioural change expected. LLVM's inner-range bounds sometimes involveLoopIndexStmtrather than plainConst/BinaryOp, so the pre-pass resolves fewer adstacks on CPU than on SPIR-V; whatever it does not resolve flows through to the same Bellman-Ford fallback as before.Tests -
tests/python/test_adstack.pytest_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size(new)Scoped to
arch=[qd.vulkan, qd.metal]withdefault_ad_stack_size=1andoffline_cache=Falseon the decorator. Kernel shape:for i in x: v = x[i]; for _ in range(5): v = qd.sin(v) + 0.1; y[None] += v. Asserts the reverse-mode gradient matches the analytically deriveddv/dxwithinrel=1e-4.Expected-fail control: on the parent branch (
heap_backed_adstack, pre-this-PR), the adaptive adstack forvfalls back todefault_ad_stack_size=1via Bellman-Ford, the innerrange(5)pushes overflow that capacity on the second iteration, andcompute.grad()raisesAdstack overflow: a reverse-mode autodiff kernel pushed more elements than the adstack capacity allows. On this branch the pre-pass resolvesmax_size = 5and the gradient comes out correctly.LLVM/arm64 is deliberately excluded from the parametrization: the LLVM pipeline sometimes rewrites inner-range bounds through
LoopIndexStmt, which the pre-pass does not fold. The user-visible out-of-memory failure this PR targets only surfaces on the heap-backed SPIR-V path, so gating the test on SPIR-V backends keeps the failure-mode coverage tight without asserting a shape the LLVM path does not currently produce.test_adstack_near_capacity(updated)Rewritten to load the inner loop's trip count from a runtime field (
n_iter_fld[None]) instead of a Python int constant. A constantrange(n_iter)would now be folded by the new pre-pass, which would resolvemax_size = n_iterdirectly and bypass thedefault_ad_stack_size = 32cap the test is there to pin. With a runtime trip count the adstack stays in the adaptive path, Bellman-Ford falls back to the 32-cap, and the K=30 (no overflow) vs K=31 (overflow) boundary remains observable. Test still has the same 6 parametrize variants (3 arches x 2 K values) and all pass on this branch; runtime trip count is the minimal change to keep the test's intent intact after the pre-pass lands.Side-effect audit
control_flow_graph.cppper-stack loop endstack->max_size = max_sizeon every stack. Fixed by the skip-if-already-set guard.QD_WARN_IF(max_size == 0, ...)warning still fires for genuinely unused ones.analysis/gen_offline_cache_key.cpp, AST-level keysAdStackAllocaStmt::max_sizeis derived by this pass post-offload; the AST-level cache key predates its assignment and is independent. Validated indirectly bytest_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_sizewithoffline_cache=Falseso the pre-pass runs every time.max_sizetransforms/scalarize.cpp:702splits tensor adstacks into scalar ones viamake_unique<AdStackAllocaStmt>(element_type, stmt->max_size)determine_ad_stack_sizeincompile_to_offloads.cpp, so the scalar children inherit the pre-pass's resolvedmax_sizerather than the pre-offload 0.compute_bounded_adstack_sizeearly-return pathresult.max_size == 0 && boundedbranch setsmax_size = 1to match Bellman-Ford's previous non-overwrite semantics for unused stacks. Codegen never seesmax_size = 0for a walked alloca.compute_bounded_adstack_sizeper-push inner loopmultiplier *= tripand the outer accumulate checkstd::numeric_limits<std::size_t>::max()bounds and fall back to unbounded on overflow.LoopIndexStmt-bounded range kernelsWhileStmtandStructForStmt;try_eval_const_i32returns false on unsupported shapesdefault_ad_stack_sizeas before. No new failure mode on those kernels.transforms/ir_printer.cppAdStackAllocaStmt::max_sizeis already printed; no new fields to plumb through.