Skip to content

[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback#539

Merged
duburcqa merged 6 commits intomainfrom
duburcqa/adstack_bounded_loop_sizing
Apr 24, 2026
Merged

[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback#539
duburcqa merged 6 commits intomainfrom
duburcqa/adstack_bounded_loop_sizing

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 21, 2026

Structural adstack sizing: bounded inner loops skip the default_ad_stack_size fallback

One commit, no behaviour change for kernels whose adstacks the existing CFG analyzer already resolves. Adds a structural pre-pass in front of the Bellman-Ford adstack-size analyzer that recovers a precise max_size for every adstack whose push sites sit only inside statically bounded RangeForStmts. Reverse-mode kernels with hundreds of loop-carried values in a range(N) inner loop stop inheriting the coarse default_ad_stack_size cap and the SPIR-V heap-backed dispatch stride collapses by orders of magnitude.

TL;DR

qd.init(arch=qd.vulkan, ad_stack_experimental_enabled=True, default_ad_stack_size=1)

@qd.kernel
def compute():
    for i in x:
        v = x[i]
        for _ in range(5):                 # bounded range -> max_size resolves to 5
            v = qd.sin(v) + 0.1
        y[None] += v

compute()
compute.grad()                             # no overflow despite default_ad_stack_size=1

Before this PR, the CFG Bellman-Ford analyzer flagged every push inside the range(5) as a "positive loop" and capped the adstack at default_ad_stack_size=1, overflowing at runtime. After, the pre-pass proves the per-thread stack depth is bounded by the product of the enclosing static range trip counts and sets max_size = 5 directly; the default is only consulted for unbounded shapes (while loops, range(field_load)).

Why

The heap-backed SPIR-V adstack introduced in #493 sizes its per-dispatch StorageBuffer as ad_stack_heap_per_thread_stride * dispatched_threads. On wide-ndrange reverse-mode kernels with a few hundred local variables promoted to adstacks, every adstack whose push lives inside a range(N) inner loop previously fell back to default_ad_stack_size = 256 via Bellman-Ford's positive-loop detection. Concrete numbers on a 600 000-thread reverse-mode grid op: 256 of 563 f32 adstacks plus 64 of 408 i32/u1 adstacks landed on the 256-cap fallback, bloating ad_stack_heap_per_thread_stride_float to 132 574 slots and pushing the required buffer past 41 GB - hard-failing Metal's maxBufferLength=28 GB cap with Failed to allocate adstack heap int buffer (size=41305523712) before the dispatch could run.

Bellman-Ford's pessimism is by design: it counts pushes and pops per CFG block and flags any cycle whose net push count is positive. It has no loop-bound awareness, so a range(27) stencil sweep reads the same as a while True loop - both are "positive loop, use the configured default". The traditional workaround, raising ad_stack_size or default_ad_stack_size via qd.init(...), goes the wrong direction - the problem is that 256 is too big for these kernels, not too small.

This PR teaches irpass::determine_ad_stack_size to resolve the bounded-loop case up front, without needing Bellman-Ford at all.

Surface API

No user-visible API changes. The existing CompileConfig::default_ad_stack_size knob still exists and still behaves the same way for the unbounded fallback path. The new logic lives entirely in irpass::determine_ad_stack_size (quadrants/transforms/determine_ad_stack_size.cpp) and runs automatically as part of every reverse-mode kernel's offload_to_executable pipeline (same call site as before).

Mechanism end-to-end

1. Structural pre-pass in determine_ad_stack_size.cpp

irpass::determine_ad_stack_size runs before the CFG Bellman-Ford analyzer. For each AdStackAllocaStmt with max_size == 0 (adaptive):

  1. Gather every AdStackPushStmt whose stack == alloca via irpass::analysis::gather_statements.
  2. For each push, walk the IR parent chain starting at push->parent->parent_stmt() and moving outward.
  3. At each hop:
    • RangeForStmt -> call try_eval_const_i32(begin) and try_eval_const_i32(end). If both fold, multiply multiplier *= (end - begin). If either fails to fold, mark the alloca as unbounded and stop.
    • StructForStmt, WhileStmt -> mark unbounded and stop (compile-time trip count is not available).
    • Any other container (IfStmt, MeshForStmt body, etc.) -> pass through without affecting the multiplier. These do not iterate, so one enclosing execution = one push.
  4. Sum the per-push multipliers across all push sites -> safe upper bound on the concurrent stack depth. Assign to alloca->max_size.
  5. If a push reaches the alloca through an unbounded enclosure, leave max_size = 0 so Bellman-Ford can still try.

The bound is pessimistic with respect to mutually exclusive if-branches (summing pushes across both arms rather than taking the max), which is safe for heap sizing: mutually exclusive pushes waste slots but never under-allocate. Overflow-checks guard both the multiply and the per-push accumulate before writing the final value.

2. try_eval_const_i32 - small arithmetic constant folder

RangeForStmt::begin and ::end are Stmt * pointers, so they may be ConstStmt directly or a shallow BinaryOpStmt(add/sub/mul/div, ConstStmt, ConstStmt) - the LLVM pipeline in particular sometimes leaves inner-range bounds as BinaryOpStmt(add, ConstStmt(0), ConstStmt(N)) because full_simplify's constant-fold is not guaranteed to have run in every configuration. Rather than depending on full_simplify's state, the pre-pass carries a local recursive evaluator that folds ConstStmt leaves and the four arithmetic BinaryOpTypes (add, sub, mul, div-with-nonzero-rhs). Any other shape (including UnaryOpStmt, LocalLoadStmt, GlobalLoadStmt) returns false and the enclosing range is treated as unbounded.

3. Handoff to the existing CFG Bellman-Ford analyzer

After the pre-pass, the existing analyzer runs on the remainder:

auto cfg = analysis::build_cfg(root);
cfg->simplify_graph();
cfg->determine_ad_stack_size(config.default_ad_stack_size);

The analyzer resolves any adstack whose push/pop pattern it can walk, and falls back to default_ad_stack_size for the genuine positive-loop cases the pre-pass marked unbounded.

4. Bellman-Ford skip guard in control_flow_graph.cpp

The existing analyzer had a latent bug that only surfaced once the pre-pass started pre-populating max_size values: at the end of its per-stack loop it unconditionally wrote stack->max_size = max_size, with max_size being a local Bellman-Ford variable that stayed 0 for any stack whose push/pop it did not see. That clobbered the pre-pass's results to 0, and codegen would then trip Adaptive autodiff stack's size should have been determined.

Fix: at the collection step in ControlFlowGraph::determine_ad_stack_size, skip adstacks whose max_size is already non-zero. They are not added to all_stacks, not walked by Bellman-Ford, not overwritten at the end.

if (stack->max_size != 0) {
  continue;
}
all_stacks.insert(stack);

This also fixes the "Unused autodiff stack should have been eliminated" warning path where Bellman-Ford would previously overwrite a non-zero value with 0 on an alloca whose pushes had been DCE'd.

Per-backend coverage matrix

Backend Observable effect
Vulkan / Metal (SPIR-V) Heap-backed adstack stride collapses by whatever factor the pre-pass recovers. On the motivating 600 000-thread reverse-mode grid op, required for the int heap dropped from 41.3 GB to well under Metal's 28 GB cap; the test that previously failed with a Failed to allocate adstack heap int buffer error now runs end-to-end.
CPU (LLVM) Function-scope allocas are stack-allocated per worker thread; over-sized max_size values were cheap there, so the user-visible effect is a compile-time-only reduction in worker-stack footprint. No behavioural change expected. LLVM's inner-range bounds sometimes involve LoopIndexStmt rather than plain Const/BinaryOp, so the pre-pass resolves fewer adstacks on CPU than on SPIR-V; whatever it does not resolve flows through to the same Bellman-Ford fallback as before.
CUDA / AMDGPU (LLVM) Same as CPU. Per-thread GPU local memory is sized by the driver, not by the heap allocator, so this PR has no measurable impact on those backends in isolation.

Tests - tests/python/test_adstack.py

test_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size (new)

Scoped to arch=[qd.vulkan, qd.metal] with default_ad_stack_size=1 and offline_cache=False on the decorator. Kernel shape: for i in x: v = x[i]; for _ in range(5): v = qd.sin(v) + 0.1; y[None] += v. Asserts the reverse-mode gradient matches the analytically derived dv/dx within rel=1e-4.

Expected-fail control: on the parent branch (heap_backed_adstack, pre-this-PR), the adaptive adstack for v falls back to default_ad_stack_size=1 via Bellman-Ford, the inner range(5) pushes overflow that capacity on the second iteration, and compute.grad() raises Adstack overflow: a reverse-mode autodiff kernel pushed more elements than the adstack capacity allows. On this branch the pre-pass resolves max_size = 5 and the gradient comes out correctly.

LLVM/arm64 is deliberately excluded from the parametrization: the LLVM pipeline sometimes rewrites inner-range bounds through LoopIndexStmt, which the pre-pass does not fold. The user-visible out-of-memory failure this PR targets only surfaces on the heap-backed SPIR-V path, so gating the test on SPIR-V backends keeps the failure-mode coverage tight without asserting a shape the LLVM path does not currently produce.

test_adstack_near_capacity (updated)

Rewritten to load the inner loop's trip count from a runtime field (n_iter_fld[None]) instead of a Python int constant. A constant range(n_iter) would now be folded by the new pre-pass, which would resolve max_size = n_iter directly and bypass the default_ad_stack_size = 32 cap the test is there to pin. With a runtime trip count the adstack stays in the adaptive path, Bellman-Ford falls back to the 32-cap, and the K=30 (no overflow) vs K=31 (overflow) boundary remains observable. Test still has the same 6 parametrize variants (3 arches x 2 K values) and all pass on this branch; runtime trip count is the minimal change to keep the test's intent intact after the pre-pass lands.

Side-effect audit

Concern Where checked Verdict
Bellman-Ford silently clobbering pre-pass results control_flow_graph.cpp per-stack loop end Was broken - Bellman-Ford unconditionally wrote stack->max_size = max_size on every stack. Fixed by the skip-if-already-set guard.
"Unused autodiff stack" warning path same per-stack loop Now a no-op for pre-resolved stacks; the QD_WARN_IF(max_size == 0, ...) warning still fires for genuinely unused ones.
Offline cache key analysis/gen_offline_cache_key.cpp, AST-level keys Not affected - AdStackAllocaStmt::max_size is derived by this pass post-offload; the AST-level cache key predates its assignment and is independent. Validated indirectly by test_adstack_bounded_inner_loop_not_capped_by_default_ad_stack_size with offline_cache=False so the pre-pass runs every time.
Scalarize pass inheriting max_size transforms/scalarize.cpp:702 splits tensor adstacks into scalar ones via make_unique<AdStackAllocaStmt>(element_type, stmt->max_size) Scalarize runs after determine_ad_stack_size in compile_to_offloads.cpp, so the scalar children inherit the pre-pass's resolved max_size rather than the pre-offload 0.
Stacks with no pushes ("unused" case) compute_bounded_adstack_size early-return path Explicit result.max_size == 0 && bounded branch sets max_size = 1 to match Bellman-Ford's previous non-overwrite semantics for unused stacks. Codegen never sees max_size = 0 for a walked alloca.
Overflow when multiplying trip counts compute_bounded_adstack_size per-push inner loop Both multiplier *= trip and the outer accumulate check std::numeric_limits<std::size_t>::max() bounds and fall back to unbounded on overflow.
While-loop / LoopIndexStmt-bounded range kernels pre-pass early-return on WhileStmt and StructForStmt; try_eval_const_i32 returns false on unsupported shapes Handed off to Bellman-Ford unchanged, falls back to default_ad_stack_size as before. No new failure mode on those kernels.
IR printer / dumper transforms/ir_printer.cpp AdStackAllocaStmt::max_size is already printed; no new fields to plumb through.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee743e91b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/transforms/determine_ad_stack_size.cpp Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a structural pre-pass to and guards the Bellman-Ford analyzer from clobbering pre-resolved stack sizes — non-trivial autodiff compiler logic that warrants a human expert reviewing the IR parent-chain walk assumptions and the correctness of the pre-pass/Bellman-Ford interaction.

Extended reasoning...

Overview

The PR touches two files in the autodiff pipeline: (new structural pre-pass) and (guard to skip pre-resolved stacks in the Bellman-Ford analyzer). A test file is updated to keep an existing test on the Bellman-Ford fallback path, and a new regression test is added for SPIR-V backends.

Security risks

None. This is a compiler optimization pass with no security-sensitive surfaces.

Level of scrutiny

The change is correctness-critical: an off-by-one in the pre-pass size bound, or a misclassification of a parent container statement, could silently under-allocate an adstack and produce wrong gradients or runtime overflow crashes. The key assumption is that after offloading the outer parallel loop appears as an (not a ) in the IR parent chain, so the walk correctly accumulates only the inner bounded-range multipliers. This assumption is backend-specific and the PR notes that the LLVM path already differs (LoopIndexStmt rewrite). A reviewer familiar with the IR lowering pipeline and the CFG-analysis interaction should verify this assumption holds across all targeted backends and pipeline configurations.

Other factors

No bugs were found by the automated hunting system. The change is well-motivated (production Metal OOM), well-commented, includes overflow guards, and the test suite is carefully updated. The rewrite to use a runtime field is thoughtful — it preserves the test's purpose of pinning the Bellman-Ford fallback knob. Given the algorithmic complexity and backend-specific IR structure assumptions, this merits a human reviewer with autodiff/CFG expertise.

@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 411dd48 to 027c733 Compare April 21, 2026 12:03
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch 3 times, most recently from 0c31cb0 to 4ba303d Compare April 21, 2026 12:45
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 027c733 to be6cdb3 Compare April 21, 2026 13:24
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 4ba303d to 9a6420c Compare April 21, 2026 13:24
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from be6cdb3 to d569cc5 Compare April 21, 2026 14:42
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from a2e9cbc to 2f3c71c Compare April 21, 2026 14:43
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from d569cc5 to 0bdbedd Compare April 21, 2026 19:06
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 2f3c71c to da1f368 Compare April 21, 2026 19:06
@hughperkins
Copy link
Copy Markdown
Collaborator

Checklist:

  • doesn't affect user front-end usage, just bug-fixes
    • => no doc changes needed
  • most changes localized to new 'determine_ad_stack_size.cpp' file, rather than modifying existing code 🙌

=> ok to merge

@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch 2 times, most recently from c8f36e6 to e76a5a0 Compare April 23, 2026 06:00
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 0bdbedd to f1963b8 Compare April 23, 2026 06:00
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from ef0b228 to 938b32a Compare April 23, 2026 12:56
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from f1963b8 to 8f25e57 Compare April 23, 2026 13:15
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 938b32a to 29b4c1a Compare April 23, 2026 13:15
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 8f25e57 to 0455812 Compare April 23, 2026 13:17
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch 2 times, most recently from 9c959ff to b5f776f Compare April 23, 2026 13:19
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 0455812 to 6f5a1d5 Compare April 23, 2026 15:01
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from b5f776f to 505af49 Compare April 23, 2026 15:01
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 6f5a1d5 to 86c528e Compare April 23, 2026 15:35
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 505af49 to d8fe6ae Compare April 23, 2026 15:35
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 86c528e to 12d82e2 Compare April 23, 2026 18:24
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from d8fe6ae to 18351d7 Compare April 23, 2026 18:24
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 12d82e2 to 426a253 Compare April 23, 2026 20:31
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 18351d7 to be1e6ae Compare April 23, 2026 20:31
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 426a253 to 8bd76af Compare April 24, 2026 05:12
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from be1e6ae to 89482e4 Compare April 24, 2026 05:12
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 8bd76af to dd68d7e Compare April 24, 2026 07:29
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 89482e4 to 54d7dba Compare April 24, 2026 07:29
Base automatically changed from duburcqa/heap_backed_adstack to main April 24, 2026 11:08
@duburcqa duburcqa force-pushed the duburcqa/adstack_bounded_loop_sizing branch from 54d7dba to d074c0e Compare April 24, 2026 11:38
@duburcqa duburcqa merged commit aeea5f9 into main Apr 24, 2026
46 of 47 checks passed
@duburcqa duburcqa deleted the duburcqa/adstack_bounded_loop_sizing branch April 24, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants