[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback by duburcqa · Pull Request #539 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-21T11:38:01Z

Structural adstack sizing: bounded inner loops skip the `default_ad_stack_size` fallback

One commit, no behaviour change for kernels whose adstacks the existing CFG analyzer already resolves. Adds a structural pre-pass in front of the Bellman-Ford adstack-size analyzer that recovers a precise max_size for every adstack whose push sites sit only inside statically bounded RangeForStmts. Reverse-mode kernels with hundreds of loop-carried values in a range(N) inner loop stop inheriting the coarse default_ad_stack_size cap and the SPIR-V heap-backed dispatch stride collapses by orders of magnitude.

TL;DR

qd.init(arch=qd.vulkan, ad_stack_experimental_enabled=True, default_ad_stack_size=1)

@qd.kernel
def compute():
    for i in x:
        v = x[i]
        for _ in range(5):                 # bounded range -> max_size resolves to 5
            v = qd.sin(v) + 0.1
        y[None] += v

compute()
compute.grad()                             # no overflow despite default_ad_stack_size=1

Before this PR, the CFG Bellman-Ford analyzer flagged every push inside the range(5) as a "positive loop" and capped the adstack at default_ad_stack_size=1, overflowing at runtime. After, the pre-pass proves the per-thread stack depth is bounded by the product of the enclosing static range trip counts and sets max_size = 5 directly; the default is only consulted for unbounded shapes (while loops, range(field_load)).

Why

The heap-backed SPIR-V adstack introduced in #493 sizes its per-dispatch StorageBuffer as ad_stack_heap_per_thread_stride * dispatched_threads. On wide-ndrange reverse-mode kernels with a few hundred local variables promoted to adstacks, every adstack whose push lives inside a range(N) inner loop previously fell back to default_ad_stack_size = 256 via Bellman-Ford's positive-loop detection. Concrete numbers on a 600 000-thread reverse-mode grid op: 256 of 563 f32 adstacks plus 64 of 408 i32/u1 adstacks landed on the 256-cap fallback, bloating ad_stack_heap_per_thread_stride_float to 132 574 slots and pushing the required buffer past 41 GB - hard-failing Metal's maxBufferLength=28 GB cap with Failed to allocate adstack heap int buffer (size=41305523712) before the dispatch could run.

Bellman-Ford's pessimism is by design: it counts pushes and pops per CFG block and flags any cycle whose net push count is positive. It has no loop-bound awareness, so a range(27) stencil sweep reads the same as a while True loop - both are "positive loop, use the configured default". The traditional workaround, raising ad_stack_size or default_ad_stack_size via qd.init(...), goes the wrong direction - the problem is that 256 is too big for these kernels, not too small.

This PR teaches irpass::determine_ad_stack_size to resolve the bounded-loop case up front, without needing Bellman-Ford at all.

Surface API

No user-visible API changes. The existing CompileConfig::default_ad_stack_size knob still exists and still behaves the same way for the unbounded fallback path. The new logic lives entirely in irpass::determine_ad_stack_size (quadrants/transforms/determine_ad_stack_size.cpp) and runs automatically as part of every reverse-mode kernel's offload_to_executable pipeline (same call site as before).

Mechanism end-to-end

1. Structural pre-pass in `determine_ad_stack_size.cpp`

irpass::determine_ad_stack_size runs before the CFG Bellman-Ford analyzer. For each AdStackAllocaStmt with max_size == 0 (adaptive):

Gather every AdStackPushStmt whose stack == alloca via irpass::analysis::gather_statements.
For each push, walk the IR parent chain starting at push->parent->parent_stmt() and moving outward.
At each hop:
- RangeForStmt -> call try_eval_const_i32(begin) and try_eval_const_i32(end). If both fold, multiply multiplier *= (end - begin). If either fails to fold, mark the alloca as unbounded and stop.
- StructForStmt, WhileStmt -> mark unbounded and stop (compile-time trip count is not available).
- Any other container (IfStmt, MeshForStmt body, etc.) -> pass through without affecting the multiplier. These do not iterate, so one enclosing execution = one push.
Sum the per-push multipliers across all push sites -> safe upper bound on the concurrent stack depth. Assign to alloca->max_size.
If a push reaches the alloca through an unbounded enclosure, leave max_size = 0 so Bellman-Ford can still try.

The bound is pessimistic with respect to mutually exclusive if-branches (summing pushes across both arms rather than taking the max), which is safe for heap sizing: mutually exclusive pushes waste slots but never under-allocate. Overflow-checks guard both the multiply and the per-push accumulate before writing the final value.

2. `try_eval_const_i32` - small arithmetic constant folder

RangeForStmt::begin and ::end are Stmt * pointers, so they may be ConstStmt directly or a shallow BinaryOpStmt(add/sub/mul/div, ConstStmt, ConstStmt) - the LLVM pipeline in particular sometimes leaves inner-range bounds as BinaryOpStmt(add, ConstStmt(0), ConstStmt(N)) because full_simplify's constant-fold is not guaranteed to have run in every configuration. Rather than depending on full_simplify's state, the pre-pass carries a local recursive evaluator that folds ConstStmt leaves and the four arithmetic BinaryOpTypes (add, sub, mul, div-with-nonzero-rhs). Any other shape (including UnaryOpStmt, LocalLoadStmt, GlobalLoadStmt) returns false and the enclosing range is treated as unbounded.

3. Handoff to the existing CFG Bellman-Ford analyzer

After the pre-pass, the existing analyzer runs on the remainder:

auto cfg = analysis::build_cfg(root);
cfg->simplify_graph();
cfg->determine_ad_stack_size(config.default_ad_stack_size);

The analyzer resolves any adstack whose push/pop pattern it can walk, and falls back to default_ad_stack_size for the genuine positive-loop cases the pre-pass marked unbounded.

4. Bellman-Ford skip guard in `control_flow_graph.cpp`

The existing analyzer had a latent bug that only surfaced once the pre-pass started pre-populating max_size values: at the end of its per-stack loop it unconditionally wrote stack->max_size = max_size, with max_size being a local Bellman-Ford variable that stayed 0 for any stack whose push/pop it did not see. That clobbered the pre-pass's results to 0, and codegen would then trip Adaptive autodiff stack's size should have been determined.

Fix: at the collection step in ControlFlowGraph::determine_ad_stack_size, skip adstacks whose max_size is already non-zero. They are not added to all_stacks, not walked by Bellman-Ford, not overwritten at the end.

if (stack->max_size != 0) {
  continue;
}
all_stacks.insert(stack);

This also fixes the "Unused autodiff stack should have been eliminated" warning path where Bellman-Ford would previously overwrite a non-zero value with 0 on an alloca whose pushes had been DCE'd.

Per-backend coverage matrix

Backend	Observable effect
Vulkan / Metal (SPIR-V)	Heap-backed adstack stride collapses by whatever factor the pre-pass recovers. On the motivating 600 000-thread reverse-mode grid op, `required` for the int heap dropped from 41.3 GB to well under Metal's 28 GB cap; the test that previously failed with a `Failed to allocate adstack heap int buffer` error now runs end-to-end.
CPU (LLVM)	Function-scope allocas are stack-allocated per worker thread; over-sized `max_size` values were cheap there, so the user-visible effect is a compile-time-only reduction in worker-stack footprint. No behavioural change expected. LLVM's inner-range bounds sometimes involve `LoopIndexStmt` rather than plain `Const`/`BinaryOp`, so the pre-pass resolves fewer adstacks on CPU than on SPIR-V; whatever it does not resolve flows through to the same Bellman-Ford fallback as before.
CUDA / AMDGPU (LLVM)	Same as CPU. Per-thread GPU local memory is sized by the driver, not by the heap allocator, so this PR has no measurable impact on those backends in isolation.

Tests - `tests/python/test_adstack.py`

`test_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size` (new)

Scoped to arch=[qd.vulkan, qd.metal] with default_ad_stack_size=1 and offline_cache=False on the decorator. Kernel shape: for i in x: v = x[i]; for _ in range(5): v = qd.sin(v) + 0.1; y[None] += v. Asserts the reverse-mode gradient matches the analytically derived dv/dx within rel=1e-4.

Expected-fail control: on the parent branch (heap_backed_adstack, pre-this-PR), the adaptive adstack for v falls back to default_ad_stack_size=1 via Bellman-Ford, the inner range(5) pushes overflow that capacity on the second iteration, and compute.grad() raises Adstack overflow: a reverse-mode autodiff kernel pushed more elements than the adstack capacity allows. On this branch the pre-pass resolves max_size = 5 and the gradient comes out correctly.

LLVM/arm64 is deliberately excluded from the parametrization: the LLVM pipeline sometimes rewrites inner-range bounds through LoopIndexStmt, which the pre-pass does not fold. The user-visible out-of-memory failure this PR targets only surfaces on the heap-backed SPIR-V path, so gating the test on SPIR-V backends keeps the failure-mode coverage tight without asserting a shape the LLVM path does not currently produce.

`test_adstack_near_capacity` (updated)

Rewritten to load the inner loop's trip count from a runtime field (n_iter_fld[None]) instead of a Python int constant. A constant range(n_iter) would now be folded by the new pre-pass, which would resolve max_size = n_iter directly and bypass the default_ad_stack_size = 32 cap the test is there to pin. With a runtime trip count the adstack stays in the adaptive path, Bellman-Ford falls back to the 32-cap, and the K=30 (no overflow) vs K=31 (overflow) boundary remains observable. Test still has the same 6 parametrize variants (3 arches x 2 K values) and all pass on this branch; runtime trip count is the minimal change to keep the test's intent intact after the pre-pass lands.

Side-effect audit

Concern	Where checked	Verdict
Bellman-Ford silently clobbering pre-pass results	`control_flow_graph.cpp` per-stack loop end	Was broken - Bellman-Ford unconditionally wrote `stack->max_size = max_size` on every stack. Fixed by the skip-if-already-set guard.
"Unused autodiff stack" warning path	same per-stack loop	Now a no-op for pre-resolved stacks; the `QD_WARN_IF(max_size == 0, ...)` warning still fires for genuinely unused ones.
Offline cache key	`analysis/gen_offline_cache_key.cpp`, AST-level keys	Not affected - `AdStackAllocaStmt::max_size` is derived by this pass post-offload; the AST-level cache key predates its assignment and is independent. Validated indirectly by `test_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size` with `offline_cache=False` so the pre-pass runs every time.
Scalarize pass inheriting `max_size`	`transforms/scalarize.cpp:702` splits tensor adstacks into scalar ones via `make_unique<AdStackAllocaStmt>(element_type, stmt->max_size)`	Scalarize runs after `determine_ad_stack_size` in `compile_to_offloads.cpp`, so the scalar children inherit the pre-pass's resolved `max_size` rather than the pre-offload 0.
Stacks with no pushes ("unused" case)	`compute_bounded_adstack_size` early-return path	Explicit `result.max_size == 0 && bounded` branch sets `max_size = 1` to match Bellman-Ford's previous non-overwrite semantics for unused stacks. Codegen never sees `max_size = 0` for a walked alloca.
Overflow when multiplying trip counts	`compute_bounded_adstack_size` per-push inner loop	Both `multiplier *= trip` and the outer accumulate check `std::numeric_limits<std::size_t>::max()` bounds and fall back to unbounded on overflow.
While-loop / `LoopIndexStmt`-bounded range kernels	pre-pass early-return on `WhileStmt` and `StructForStmt`; `try_eval_const_i32` returns false on unsupported shapes	Handed off to Bellman-Ford unchanged, falls back to `default_ad_stack_size` as before. No new failure mode on those kernels.
IR printer / dumper	`transforms/ir_printer.cpp`	`AdStackAllocaStmt::max_size` is already printed; no new fields to plumb through.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee743e91b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

claude

This PR adds a structural pre-pass to and guards the Bellman-Ford analyzer from clobbering pre-resolved stack sizes — non-trivial autodiff compiler logic that warrants a human expert reviewing the IR parent-chain walk assumptions and the correctness of the pre-pass/Bellman-Ford interaction.

Extended reasoning...

Overview

The PR touches two files in the autodiff pipeline: (new structural pre-pass) and (guard to skip pre-resolved stacks in the Bellman-Ford analyzer). A test file is updated to keep an existing test on the Bellman-Ford fallback path, and a new regression test is added for SPIR-V backends.

Security risks

None. This is a compiler optimization pass with no security-sensitive surfaces.

Level of scrutiny

The change is correctness-critical: an off-by-one in the pre-pass size bound, or a misclassification of a parent container statement, could silently under-allocate an adstack and produce wrong gradients or runtime overflow crashes. The key assumption is that after offloading the outer parallel loop appears as an (not a ) in the IR parent chain, so the walk correctly accumulates only the inner bounded-range multipliers. This assumption is backend-specific and the PR notes that the LLVM path already differs (LoopIndexStmt rewrite). A reviewer familiar with the IR lowering pipeline and the CFG-analysis interaction should verify this assumption holds across all targeted backends and pipeline configurations.

Other factors

No bugs were found by the automated hunting system. The change is well-motivated (production Metal OOM), well-commented, includes overflow guards, and the test suite is carefully updated. The rewrite to use a runtime field is thoughtful — it preserves the test's purpose of pinning the Bellman-Ford fallback knob. Given the algorithmic complexity and backend-specific IR structure assumptions, this merits a human reviewer with autodiff/CFG expertise.

hughperkins · 2026-04-22T12:53:26Z

Checklist:

doesn't affect user front-end usage, just bug-fixes
- => no doc changes needed
most changes localized to new 'determine_ad_stack_size.cpp' file, rather than modifying existing code 🙌

=> ok to merge

…efault_ad_stack_size fallback

…d push/pop stacks keep their tight CFG-computed bound

…not trip a null-pool deref after ~32 two-set launches

…e-fors with ndarray-element vs shape trip counts

…regression test

…M reads same val_int helpers

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/transforms/determine_ad_stack_size.cpp Outdated

claude Bot reviewed Apr 21, 2026

View reviewed changes

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 411dd48 to 027c733 Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch 3 times, most recently from 0c31cb0 to 4ba303d Compare April 21, 2026 12:45

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 027c733 to be6cdb3 Compare April 21, 2026 13:24

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 4ba303d to 9a6420c Compare April 21, 2026 13:24

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from be6cdb3 to d569cc5 Compare April 21, 2026 14:42

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from a2e9cbc to 2f3c71c Compare April 21, 2026 14:43

duburcqa mentioned this pull request Apr 21, 2026

[Bug]: loss.backward() hangs indefinitely for articulated robots with freejoint + child joints Genesis-Embodied-AI/Genesis#2537

Open

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from d569cc5 to 0bdbedd Compare April 21, 2026 19:06

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 2f3c71c to da1f368 Compare April 21, 2026 19:06

duburcqa mentioned this pull request Apr 21, 2026

[Build] Autodiff 15: Replace 2022 MoltenVK pin with LunarG Vulkan SDK fetch #542

Closed

hughperkins added the ok-to-merge label Apr 22, 2026

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch 2 times, most recently from c8f36e6 to e76a5a0 Compare April 23, 2026 06:00

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 0bdbedd to f1963b8 Compare April 23, 2026 06:00

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from ef0b228 to 938b32a Compare April 23, 2026 12:56

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from f1963b8 to 8f25e57 Compare April 23, 2026 13:15

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 938b32a to 29b4c1a Compare April 23, 2026 13:15

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 8f25e57 to 0455812 Compare April 23, 2026 13:17

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch 2 times, most recently from 9c959ff to b5f776f Compare April 23, 2026 13:19

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 0455812 to 6f5a1d5 Compare April 23, 2026 15:01

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from b5f776f to 505af49 Compare April 23, 2026 15:01

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 6f5a1d5 to 86c528e Compare April 23, 2026 15:35

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 505af49 to d8fe6ae Compare April 23, 2026 15:35

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 86c528e to 12d82e2 Compare April 23, 2026 18:24

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from d8fe6ae to 18351d7 Compare April 23, 2026 18:24

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 12d82e2 to 426a253 Compare April 23, 2026 20:31

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 18351d7 to be1e6ae Compare April 23, 2026 20:31

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 426a253 to 8bd76af Compare April 24, 2026 05:12

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from be1e6ae to 89482e4 Compare April 24, 2026 05:12

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 8bd76af to dd68d7e Compare April 24, 2026 07:29

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 89482e4 to 54d7dba Compare April 24, 2026 07:29

Base automatically changed from duburcqa/heap_backed_adstack to main April 24, 2026 11:08

duburcqa added 6 commits April 24, 2026 13:32

[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without d…

4fb23fc

…efault_ad_stack_size fallback

[AutoDiff] Run Bellman-Ford before the structural pre-pass so balance…

795758b

…d push/pop stacks keep their tight CFG-computed bound

[Vulkan] Free descriptor sets on shared_ptr release so MoltenVK does …

af51de9

…not trip a null-pool deref after ~32 two-set launches

[AutoDiff] Regression test: reverse-mode gradient through nested rang…

5e239a2

…e-fors with ndarray-element vs shape trip counts

[AutoDiff] Document adstack sizer's stale-ndarray limitation + xfail …

68be66c

…regression test

[AutoDiff/Tests] Widen i64 const-leaf bound test to all backends; LLV…

d074c0e

…M reads same val_int helpers

duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 54d7dba to d074c0e Compare April 24, 2026 11:38

duburcqa merged commit aeea5f9 into main Apr 24, 2026
46 of 47 checks passed

duburcqa deleted the duburcqa/adstack_bounded_loop_sizing branch April 24, 2026 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback#539

[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback#539
duburcqa merged 6 commits intomainfrom
duburcqa/adstack_bounded_loop_sizing

duburcqa commented Apr 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

hughperkins commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duburcqa commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Structural adstack sizing: bounded inner loops skip the default_ad_stack_size fallback

TL;DR

Why

Surface API

Mechanism end-to-end

1. Structural pre-pass in determine_ad_stack_size.cpp

2. try_eval_const_i32 - small arithmetic constant folder

3. Handoff to the existing CFG Bellman-Ford analyzer

4. Bellman-Ford skip guard in control_flow_graph.cpp

Per-backend coverage matrix

Tests - tests/python/test_adstack.py

test_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size (new)

test_adstack_near_capacity (updated)

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

hughperkins commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented Apr 21, 2026 •

edited

Loading

Structural adstack sizing: bounded inner loops skip the `default_ad_stack_size` fallback

1. Structural pre-pass in `determine_ad_stack_size.cpp`

2. `try_eval_const_i32` - small arithmetic constant folder

4. Bellman-Ford skip guard in `control_flow_graph.cpp`

Tests - `tests/python/test_adstack.py`

`test_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size` (new)

`test_adstack_near_capacity` (updated)