Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc3bf3f5bb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
453d0a8 to
416d869
Compare
bc3bf3f to
b757b5a
Compare
416d869 to
f213cd6
Compare
b757b5a to
4b1b402
Compare
babc027 to
5182192
Compare
6c6d25d to
14ae74f
Compare
b472593 to
8905b76
Compare
67c2eb7 to
44679dd
Compare
8905b76 to
d8773c9
Compare
44679dd to
8123ff1
Compare
d8773c9 to
a4e514f
Compare
8123ff1 to
526292d
Compare
a4e514f to
1a62fd4
Compare
526292d to
860a914
Compare
1a62fd4 to
595138f
Compare
860a914 to
e9091bc
Compare
595138f to
8a95d10
Compare
e9091bc to
63ebc09
Compare
8a95d10 to
04afb7b
Compare
63ebc09 to
8fdb352
Compare
fcb659e to
8a4f62e
Compare
8fdb352 to
69fffa2
Compare
5b6aec9 to
59101da
Compare
bad2fb0 to
2e2754c
Compare
… drop exhaustive arch lists With the AMDGPU device-context fix from PR #550 in the stack, the runtime-eval sizer path now runs cleanly on every backend. Two tests introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass` and `spirv_metadata_per_task_buffer`) get their arch lists extended to include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal, qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the default (no `arch=` arg) since that enumerates every supported backend anyway. Also drop the redundant `ad_stack_experimental_enabled=True` on `spirv_metadata_per_task_buffer`: `test_utils.test` sets that option unconditionally whenever `qd.extension.adstack` is in `require`, and carrying it on the decorator confuses readers into thinking the test needs it specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
59101da to
2859a1d
Compare
2e2754c to
0be734d
Compare
… drop exhaustive arch lists With the AMDGPU device-context fix from PR #550 in the stack, the runtime-eval sizer path now runs cleanly on every backend. Two tests introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass` and `spirv_metadata_per_task_buffer`) get their arch lists extended to include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal, qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the default (no `arch=` arg) since that enumerates every supported backend anyway. Also drop the redundant `ad_stack_experimental_enabled=True` on `spirv_metadata_per_task_buffer`: `test_utils.test` sets that option unconditionally whenever `qd.extension.adstack` is in `require`, and carrying it on the decorator confuses readers into thinking the test needs it specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0be734d to
ebce54e
Compare
… drop exhaustive arch lists With the AMDGPU device-context fix from PR #550 in the stack, the runtime-eval sizer path now runs cleanly on every backend. Two tests introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass` and `spirv_metadata_per_task_buffer`) get their arch lists extended to include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal, qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the default (no `arch=` arg) since that enumerates every supported backend anyway. Also drop the redundant `ad_stack_experimental_enabled=True` on `spirv_metadata_per_task_buffer`: `test_utils.test` sets that option unconditionally whenever `qd.extension.adstack` is in `require`, and carrying it on the decorator confuses readers into thinking the test needs it specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| - Adstack overflow does not exist on LLVM backends for kernels the pre-pass can bound (every shape in the grammar listed under "Under the hood: adstack capacity and memory"). On SPIR-V backends the compile-time-sized capacity can still be exceeded at runtime; overflow is reported as a Python-level exception asynchronously - the offending kernel writes to a host-polled SSBO flag during execution, and the next `qd.sync()` (explicit, or implicit via a host read like `to_numpy()` / `to_torch()`) reads the flag and raises. This follows the same pattern as CUDA async errors so every launch does not pay a per-launch sync. If you want the exception to land exactly at the offending kernel rather than at the next sync, call `qd.sync()` right after the kernel. | ||
| - Adstack trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile-time accordingly when migrating existing reverse-mode AD workloads. | ||
| - Reverse-mode AD does not propagate gradients through integer casts or non-real operations. No error is raised; the gradient simply stops at the cast and silently reads as zero upstream. Cast to `qd.f32` / `qd.f64` before the differentiable section. | ||
| - Backward passes on non-trivial kernels run noticeably slower than the corresponding forward pass, sometimes by an order of magnitude on SPIR-V. | ||
| - **Loop bounds read from a writable ndarray are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]`, the sizer snapshots `n[j]` at backward-dispatch entry. Any scenario that makes that snapshot diverge from the value the kernel body ends up executing against drives the adstack undersized and the gradient silently comes out zero or wrong. Two patterns to avoid: (a) the same kernel writes `n[j]` before the loop reads it (`for i in range(n[j])` after `n[j] = something`); (b) across kernels, where `kernel_A` writes `n` and `kernel_B.grad()` reads it as a loop bound - the per-launch numpy-ndarray upload can overwrite the device copy with host-side state that differs from what `kernel_A` left behind, and for `qd.ndarray` a missing host-side barrier between the two launches has the same effect. Quadrants does not detect either pattern. Workaround: treat ndarrays the reverse-mode kernel reads as loop bounds as read-only within and across the relevant kernel chain - compute iteration counts once, store them in a dedicated scalar field or ndarray that is never written during the backward path, and iterate over that. | ||
| - **Loop bounds that read an ndarray the same kernel writes are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]` and the same kernel also writes to `n[j]` before that loop runs (`for i in range(n[j])` after `n[j] = something`), the computed gradient may silently come out zero or wrong. Quadrants does not currently detect this. Workaround: put iteration counts in a separate ndarray (or scalar field) the kernel only reads, or split the write and the differentiable loop into two `@qd.kernel` calls. The same caveat applies across kernels - a kernel that writes an ndarray read as a loop bound by a later `.grad()` call needs the host-side contents of that ndarray to already hold the value the backward kernel will execute against. | ||
| - **Deeply nested dynamic-range loops can still trip an adstack overflow at runtime even when every loop bound is inside the pre-pass grammar.** The compiler builds a symbolic upper-bound expression for each adstack and evaluates it at dispatch time against the live ndarray contents; on kernels with many levels of nested `for i in range(arr[...])` feeding cumulative-index patterns the expression can be mathematically tighter than the actual push count. Symptom is `Adstack overflow (offending stack_id=N)` on the next `qd.sync()`. Workaround until the pre-pass grammar is extended: shorten the reverse kernel's loop nest, precompute the worst-case inner trip into a scalar field the kernel only reads, or split the inner section into its own `@qd.kernel`. |
There was a problem hiding this comment.
🟡 The Known Limitations section at docs/source/user_guide/autodiff.md:313 ("Adstack overflow does not exist on LLVM backends for kernels the pre-pass can bound (every shape in the grammar listed under "Under the hood: adstack capacity and memory")") directly contradicts line 318 in the same bullet list ("Deeply nested dynamic-range loops can still trip an adstack overflow at runtime even when every loop bound is inside the pre-pass grammar") — both explicitly reference shapes inside the grammar so they cannot both hold. Two supporting claims earlier in the same doc are also false relative to quadrants/transforms/determine_ad_stack_size.cpp:1370-1376, which only emits QD_WARN and falls back to alloca->max_size = config.default_ad_stack_size for unresolved grammar shapes: line 256 ("Loop shapes outside this grammar raise a compile error ... silent over-allocation is not possible") and line 258 ("default_ad_stack_size and ad_stack_size are ignored on LLVM"). Doc-only, low blast radius but user-misleading; the author already committed to a docs fix in resolved comment #3137465122 and the active hughperkins threads (#3138031879 / #3138006046 / #3138009751) are going to touch this section anyway, so it's worth rewriting the four lines to describe the actual warn-and-fallback semantics rather than shipping an internal contradiction.
Extended reasoning...
What the bug is
docs/source/user_guide/autodiff.md contains an internal contradiction inside the Known Limitations bullet list that ships with this PR:
- Line 313: "Adstack overflow does not exist on LLVM backends for kernels the pre-pass can bound (every shape in the grammar listed under "Under the hood: adstack capacity and memory")."
- Line 318: "Deeply nested dynamic-range loops can still trip an adstack overflow at runtime even when every loop bound is inside the pre-pass grammar."
Both bullets explicitly reference shapes inside the grammar. They cannot both be true.
Two other load-bearing claims earlier in the document are also false relative to the code this PR ships:
- Line 256: "Loop shapes outside this grammar raise a compile error naming the source location of the offending adstack alloca; silent over-allocation is not possible."
- Line 258: "
default_ad_stack_sizeandad_stack_sizeare ignored on LLVM."
The code path
quadrants/transforms/determine_ad_stack_size.cpp runs Bellman-Ford + structural SizeExpr pre-pass; any alloca still unresolved afterwards falls into the branch at lines 1362-1376:
const int fallback = std::max<int>(config.default_ad_stack_size, 1);
QD_WARN(
"adstack bound at {} is unresolved after Bellman-Ford + structural pre-pass{}; falling back to"
" default_ad_stack_size={}.{} ...",
alloca->get_last_tb(), cycle_note, fallback, dump_hint);
alloca->max_size = static_cast<std::size_t>(fallback);
alloca->size_expr = SizeExpr::make_const(static_cast<int64_t>(fallback));That is QD_WARN (not QD_ERROR) and max_size = default_ad_stack_size. The in-code comment at lines 1346-1354 explicitly says hard-erroring was rolled back because it broke test_double_for_loops / test_double_for_loops_more_nests, and that the runtime overflow check still fires at qd.sync() if the fallback is exceeded. So:
- Line 256 is wrong: there is no compile error and silent over-allocation is exactly what the fallback does.
- Line 258 is wrong:
default_ad_stack_sizeis the load-bearing knob in that fallback path on LLVM. - Line 313 is wrong (or, equivalently, line 318 is wrong): the grammar-covered-but-pre-pass-gave-up case exists and is a runtime overflow path.
Step-by-step proof
- User writes a reverse-mode kernel with a shape the grammar lists as supported in the "Under the hood" section (e.g. a FieldLoad-bounded loop like
for j in range(field[i])whereiis a loop-carried variable — the comment at determine_ad_stack_size.cpp:1353 calls out exactly this pattern as "the grammar does not (yet) cover those shapes"). - Bellman-Ford leaves
alloca->max_size == 0(positive loop). - Structural
build_value_exprreturnsnullptr(grammar gap inside a shape the user-facing docs called supported). - The fallback loop at lines 1362-1376 fires:
QD_WARNis emitted (trivial to miss),alloca->max_size = 256(the default), compile succeeds. - At launch,
publish_adstack_metadatawrites256intoadstack_max_sizes[i]. The actual kernel depth exceeds 256, tripsruntime->adstack_overflow_flag,qd.sync()raisesQuadrantsAssertionError. - This manifests the behavior line 318 admits. Line 313 promised it could not.
Impact
Docs-only, low blast radius for existing users (no code break), but actively misleading for any reader trying to reason about whether they need default_ad_stack_size on LLVM. A user reading line 258 will not set the knob; a user reading line 313 will not account for overflow paths that actually exist in the same kernels. The internal 313-vs-318 contradiction is also a self-refuting artifact in the same bullet list — reading top-to-bottom the user cannot tell which bullet to trust.
Addressing the duplicate refutation
One verifier flagged this as a duplicate of the earlier inline-comment 3137066389 ("previous_5"). That earlier comment was marked resolved at #3137465122 with the author saying the docs were updated to match the warn-and-fallback semantics. Reading the updated text at lines 246/256/258/313/318 in this PR's current diff, the update was incomplete: lines 256 / 258 still carry the old "compile error" / "ignored on LLVM" claims, and the 313 / 318 pair is a new internal contradiction introduced by the same rewrite. So this is not a no-op re-submission — the three docs-vs-code mismatches are still live and the self-contradiction is specific to this PR's text. Re-reporting as nit rather than normal because the direction of the fix (doc rewrite) is already accepted per the author's own resolution comment, and because hughperkins' active review threads (#3138031879 flags exactly line 313, #3138009751 asks to move SizeExpr/grammar details out of user-facing docs, #3138006046 flags the whole advanced section) are going to touch these lines anyway — the report is a specific concrete contradiction to resolve when that rewrite lands, not a fresh direction to take.
How to fix
Rewrite lines 246 / 256 / 258 / 313 to describe the actual warn-and-fallback semantics, matching the intent of line 318:
- Line 246: drop the "no runtime overflow path to hit" claim — the runtime overflow path exists for grammar gaps.
- Line 256: replace "raise a compile error ... silent over-allocation is not possible" with "emit a
QD_WARNat compile time naming the source location and fall back todefault_ad_stack_size; runtime overflow still surfaces at the nextqd.sync()if the fallback is exceeded". - Line 258: change "ignored on LLVM" to "still used on LLVM for adstacks whose bound is outside the
SizeExprgrammar (the warn-and-fallback path)". - Line 313: either remove it entirely or rephrase to match line 318 (runtime overflow exists for the grammar-gap fallback path; it does not exist for shapes the pre-pass can symbolically bound).
Alternatively, per hughperkins' review comments, move the SizeExpr grammar / sizing detail out of the user-facing guide and into an internal README, which makes the precision question moot at the user-facing surface.
🔬 also observed by verify-runtime
| // Two `GlobalLoadStmt` source shapes matter: before `lower_access` the source is a `GlobalPtrStmt` that carries | ||
| // the snode and the raw index stmts; after `lower_access` it becomes a `GetChStmt` rooted in a chain of | ||
| // `SNodeLookupStmt` / `GetChStmt` / `GetRootStmt`. `determine_ad_stack_size` runs after `lower_access` in the | ||
| // normal pipeline (the structural pre-pass used to accept only the pre-lowering shape and therefore gave up on | ||
| // every scalar field load), so we walk both forms here. |
There was a problem hiding this comment.
🟡 Two stale comments and a now-unreachable branch from this PR's earlier reorder of the adstack pre-pass. (1) In quadrants/transforms/determine_ad_stack_size.cpp:593-597, the GlobalLoadStmt comment says the pre-pass "runs after lower_access in the normal pipeline" — the opposite is true now that it runs at compile_to_offloads.cpp:161 (before offload_to_executable's lower_access at line 268), and the GetChStmt sub-branch at lines 746-858 is dead code in the normal pipeline since GetChStmt is only produced by lower_access. (2) In tests/python/test_adstack.py:1651-1655, the "CPU leg is load-bearing for the chunk-loop guard" rationale is stale for the same reason — make_cpu_multithreaded_range_for runs at compile_to_offloads.cpp:210 (inside offload_to_executable), strictly after the pre-pass. Nit-level: no functional impact today, but the comment is actively misleading and the dead branch will bit-rot; fix by deleting the GetChStmt branch, updating the GlobalLoadStmt comment to state the pre-pass operates only on pre-lower_access IR, and dropping the chunk-loop rationale from the test comment (the alloca_scope_chain guard in compute_bounded_adstack_size is still justified by the stack_init per-entry reset invariant).
Extended reasoning...
What the stale comments and dead code claim vs. what the pipeline actually does
This PR (commit adding determine_ad_stack_size to compile_to_offloads) moved the adstack pre-pass from offload_to_executable to the end of compile_to_offloads. Three artifacts of that move were left behind:
1. Stale in-code comment at determine_ad_stack_size.cpp:593-597:
// Two `GlobalLoadStmt` source shapes matter: before `lower_access` the source is a `GlobalPtrStmt` ...
// after `lower_access` it becomes a `GetChStmt` ... `determine_ad_stack_size` runs after `lower_access`
// in the normal pipeline ... so we walk both forms here.The factual claim "determine_ad_stack_size runs after lower_access in the normal pipeline" is now false. Evidence:
compile_to_offloads.cpp:161(insidecompile_to_offloads) callsirpass::determine_ad_stack_size(ir, config).compile_to_offloads.cpp:268(insideoffload_to_executable) callsirpass::lower_access(...).compile_to_executablerunscompile_to_offloadsfirst andoffload_to_executablesecond (viaKernelCodeGen::compile_kernel_to_module's sequencing).
So on the normal pipeline, the pre-pass sees pre-lower_access IR — the file's own comment at compile_to_offloads.cpp:288-292 explicitly acknowledges (void)determine_ad_stack_size; because the old call site inside offload_to_executable was removed.
2. Dead GetChStmt sub-branch at determine_ad_stack_size.cpp:746-858:
GetChStmt nodes are only produced by lower_access (e.g. lower_access.cpp push_back<GetChStmt> sites). Since the pre-pass now runs strictly before lower_access in every production code path, load->src->cast<GetChStmt>() at line 746 always returns nullptr and the entire ~112-line branch is unreachable. The C++ unit tests in tests/cpp/transforms/determine_ad_stack_size_test.cpp construct IR via IRBuilder directly and never invoke lower_access, so this branch has zero test coverage.
3. Stale test comment at tests/python/test_adstack.py:1651-1655:
# The CPU leg is load-bearing for the chunk-loop guard: the grad-kernel's CPU-multithreaded outer
# `range_for` wraps the user body with `min(N, (thread_idx << k) + chunk_size)`-style bounds,
# and the structural pre-pass must stop walking at the `AdStackAllocaStmt`'s own scope ...
# or the outer chunk loop's unparseable bounds would leak into the multiplier and flip the size to unresolved.make_cpu_multithreaded_range_for runs at compile_to_offloads.cpp:210 (inside offload_to_executable), which executes strictly after compile_to_offloads's pre-pass call. So at the point determine_ad_stack_size runs, the min(N, (thread_idx << k) + chunk_size) chunk wrapper the comment describes does not yet exist in the IR. The file itself spells this out in the same pass at determine_ad_stack_size.cpp:427-429 — "the user's original RangeForStmt bounds are still visible here and the chunk wrapper's min(end_stmt, bit_shl-add) shape never appears in this pre-pass's input IR."
Why this is worth filing despite no functional impact
Severity is clearly nit — all six verifiers agreed on that; none of the three items affects correctness today. But:
- The in-code comment at 595-596 is load-bearing for a future reader's mental model of when the pass runs relative to
lower_access. A maintainer debugging the pass will consult this comment before any external docs, and the current wording sends them looking in the wrong direction. - The
GetChStmtbranch is 112 lines of conditionally-compiled code that (a) adds maintenance burden to any future refactor of the GlobalLoadStmt handler, (b) has no test coverage, and (c) will bit-rot silently because no CI exercise touches it. If the pass is ever re-added to a post-lower_accessposition, the branch should be re-derived against the current IR shape rather than restored verbatim from a drifting copy. - The test comment compounds the misleading documentation: a reader seeing the CPU gate thinks the test is specifically covering the chunk-loop interaction, when in fact it is covering the field-load sizing and the CPU gate is just a follow-up-PR scoping restriction.
Step-by-step proof
- Open
compile_to_offloads.cpp. Line 161 (insidecompile_to_offloads):irpass::determine_ad_stack_size(ir, config);. Line 210 (insideoffload_to_executable):irpass::make_cpu_multithreaded_range_for(ir, config);. Line 268 (insideoffload_to_executable):irpass::lower_access(...);. - Search
KernelCodeGen::compile_kernel_to_module/compile_to_executable:compile_to_offloadsis invoked beforeoffload_to_executable. - Therefore when
determine_ad_stack_sizeruns,lower_accesshas not yet executed andmake_cpu_multithreaded_range_forhas not yet executed. GetChStmtis only produced bylower_access(greppush_back<GetChStmt>→ hits are inlower_access.cpp).- Therefore
GlobalLoadStmt::srcis neverGetChStmtduring this pre-pass — the branch atdetermine_ad_stack_size.cpp:746-858is unreachable. - Therefore the outer chunk-loop wrapper the test comment describes does not exist at pre-pass time — that specific rationale for the
alloca_scope_chainguard no longer applies (though the guard itself is still sound for thestack_initreset reason).
How to fix
quadrants/transforms/determine_ad_stack_size.cpp:593-597: rewrite the comment to state the pre-pass operates only on pre-lower_accessIR (soGlobalLoadStmt::srcis alwaysGlobalPtrStmtorExternalPtrStmt/GlobalTemporaryStmt).quadrants/transforms/determine_ad_stack_size.cpp:746-858: delete theGetChStmtbranch entirely; if the pass is ever re-added post-lower_access, re-derive against the then-current IR shape.tests/python/test_adstack.py:1651-1655: drop the chunk-loop rationale. Thealloca_scope_chainguard incompute_bounded_adstack_sizeis still justified by theAdStackAllocaStmt::stack_initreset-per-entry invariant (the walker must stop at the alloca's own scope so outer loops do not compound per-iteration depth).
| // Publish the per-task adstack metadata into the LLVMRuntime struct and size the heap. The codegen path loads | ||
| // stride / offset / max_size from these fields at every `AdStack*` site (see `ensure_ad_stack_metadata_llvm` in | ||
| // codegen_llvm.cpp), so we must write them before every launch even for tasks where the compile-time and | ||
| // launch-time bounds agree. `evaluate_adstack_size_expr` is called only when the symbolic tree is available; the | ||
| // offline cache does not currently serialize `SizeExpr`, so cache hits fall back to `max_size_compile_time`. |
There was a problem hiding this comment.
🟡 The comment at llvm_runtime_executor.cpp:564-568 claims the offline cache does not serialize SizeExpr and cache-hit kernels therefore fall back to max_size_compile_time, but the cache actually round-trips the symbolic tree end-to-end (AdStackSizingInfo::QD_IO_DEF at llvm_compiled_data.h:54-62 includes size_exprs, and both SerializedSizeExpr and SerializedSizeExprNode carry full QD_IO_DEFs). A matching stale comment at llvm_compiled_data.h:28-32 ("after offline-cache load where the symbolic tree is not serialized") directly contradicts the accurate sibling comment at lines 49-52 in the same file. Nit: update both comments to say that SizeExpr round-trips through the cache and that max_size_compile_time is only the fallback for an empty tree (Bellman-Ford-resolved or unresolved kernels).
Extended reasoning...
What the bug is
Two stale in-code comments claim that the offline cache does not serialize SizeExpr, and that a cache hit therefore routes every adstack's sizing back through max_size_compile_time. Neither is true under this PR.
quadrants/runtime/llvm/llvm_runtime_executor.cpp:564-568:// evaluate_adstack_size_expr is called only when the symbolic tree is available; the // offline cache does not currently serialize SizeExpr, so cache hits fall back to // max_size_compile_time.
quadrants/codegen/llvm/llvm_compiled_data.h:28-32:"
max_size_compile_timeis the compile-time bound (used whensize_expris absent, e.g. after offline-cache load where the symbolic tree is not serialized)"
Both sites carry the same stale claim. The sibling comment on AdStackSizingInfo::size_exprs at llvm_compiled_data.h:49-52 already states the correct behaviour ("The flat post-order form survives the offline cache (an empty nodes vector means 'no symbolic bound captured'..."), so the file is internally self-contradictory.
Verification that SizeExpr does survive the cache
The cache path serializes every field via QD_IO_DEF:
AdStackSizingInfo::QD_IO_DEFatllvm_compiled_data.h:54-62explicitly listssize_exprsalongsideallocasandper_thread_stride.SerializedSizeExpratquadrants/ir/adstack_size_expr.h:56isQD_IO_DEF(nodes).SerializedSizeExprNodeatadstack_size_expr.h:39isQD_IO_DEF(kind, const_value, snode_id, indices, operand_a, operand_b, var_id, body_node_idx, arg_id_path, arg_shape_axis)— every field of the flat node.OffloadedTask::ad_stackparticipates inQD_IO_DEF(via the enclosing task serialization), so theAdStackSizingInfoon each task rehydrates fully from cache.
A cache-hit kernel therefore reloads its size_exprs vector and publish_adstack_metadata runs evaluate_adstack_size_expr on it exactly like a fresh compile. max_size_compile_time is only read when the tree is empty (expr->nodes.empty()), which happens for kernels whose capacity the Bellman-Ford phase resolved structurally without emitting a symbolic tree, or the warn-and-fallback path. That is a tree-emptiness gate, not a cache-hit gate.
Impact
No runtime behaviour is affected — SizeExpr really does flow through the cache and really is evaluated on cache-hit launches. The defect is purely documentation: a maintainer auditing the cache-hit path (e.g. wondering why a cached kernel still works when bumping default_ad_stack_size, or debugging a per-launch sizer failure they suspect is cache-triggered) will read either stale comment and form an incorrect mental model of what runs when. The two stale sites share copy-pasted language, so the next reader hitting both sees a consistent but wrong picture.
Step-by-step proof
CompiledKernelDataserialises viaQD_IO_DEF(compiled_data)→LLVMCompiledKernel::QD_IO_DEF(tasks)→OffloadedTask::QD_IO_DEF(..., ad_stack)→AdStackSizingInfo::QD_IO_DEF(..., size_exprs).- On cache load,
size_exprs[i]comes back non-empty whenever the pre-pass captured a symbolic tree. publish_adstack_metadatapicks theSerializedSizeExpr *exprout ofad_stack.size_exprs[i], checks!expr->nodes.empty(), and callsevaluate_adstack_size_expr(CPU) orencode_adstack_size_expr_device_bytecode+runtime_eval_adstack_size_expr(GPU).max_size_compile_timeis read only via theif (v < 0)branch (CPU) orroot_node_idx < 0(device), both of which correspond to empty-tree fallback — independent of cache state.
How to fix
Rewrite the llvm_runtime_executor.cpp:564-568 comment to describe the actual gate (tree-emptiness, not cache-hit), and rewrite the llvm_compiled_data.h:28-32 comment to match the accurate sibling at lines 49-52. Two small one-line rewrites; no code change needed.
…-evaluated SizeExpr on CPU / CUDA / AMDGPU
…ia same-axis ExternalTensorShape match
… tests to cover CUDA
…er kernel on AMDGPU The runtime-eval sizer kernel dereferences `ctx->arg_buffer` on device to resolve `ExternalTensorRead` leaves against ndarray pointers. CUDA survived via UVA (the host `&ctx->get_context()` resolves transparently on the device); AMDGPU/HIP has no UVA for non-pinned host allocations, so the host pointer faulted inside the kernel and the next DtoH sync raised `hipErrorIllegalAddress ... while calling memcpy_device_to_host` on every reverse-mode kernel whose adstack has a non-empty `size_exprs` tree. The AMDGPU launcher already stages a device-side copy of the `RuntimeContext` at `kernel_launcher.cpp:185-186`; thread that `context_pointer` through `publish_adstack_metadata` and use it as the sizer kernel's `ctx` argument. CUDA / CPU pass `nullptr` and the function falls back to `&ctx->get_context()`, preserving the existing UVA / in-process paths there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tests to cover AMDGPU
Follow-up to the AMDGPU device-context fix: extend the five CUDA-covered
runtime-eval tests (`field_load_bounded_loop_evaluated_per_launch`,
`inner_range_bounded_by_ndarray_read_at_outer_index`,
`ext_tensor_read_indexed_by_stashed_outer_loop_var`,
`structural_pre_pass_fuses_sub_of_max_over_range_with_{matching,mismatched}_shape_ends`)
to AMDGPU now that the sizer kernel receives a device-side
`RuntimeContext` pointer and runs cleanly there.
Also drop the redundant `ad_stack_experimental_enabled=True` on
`test_adstack_triangular_ndrange_self_referential_push_idempotency`:
`test_utils.test` already sets that option unconditionally whenever
`qd.extension.adstack` is in `require`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_a, shape_b) so the fused body cannot read the shorter ndarray past its buffer end
…tmp hardcoded dump path from adstack SizeExpr comments
…_struct_arg_host so CUDA/AMDGPU launchers no longer deref a swapped device arg_buffer from host
…M-iter ranges and hoist os import in test_adstack
…prDeviceMaxBoundVars) at encode time so the device interpreter's fixed-size scope array stays in bounds regardless of the host-side monotonic var_id counter
…ainst max_size_compile_time (it is only a conservative seed, not a proven bound) and precompute per-axis strides in host evaluate_external_tensor_read so CPU multi-dim ndarray reads resolve to the correct cell
…n: parametrize the mismatched-lengths test with a peak-past-shape-b case that silently under-sizes the adstack, and add a fld[i]-arr[i] test that the fusion must not synthesize a mixed FieldLoad+ExtRead body the LLVM-GPU encoder cannot lower
…h_shape_same_axis widening (cross-ndarray independent loops no longer fabricate an index pairing that drops arr_a peaks past shape_b); decline fusion when the synthesized body would mix a FieldLoad indexed by a free bound variable with an ExternalTensorRead (the LLVM device interpreter has no on-device SNode access); fall back to MaxOverRange_a alone as a sound upper bound on Sub(MaxOverRange, MaxOverRange) whenever fusion cannot fire
… SizeExpr sizer kernel when the CUDA driver + kernel do not expose HMM / system-allocated memory (queried via CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS), mirroring the AMDGPU launcher's existing pattern so the sizer's ctx->arg_buffer PSB read no longer faults with CUDA_ERROR_ILLEGAL_ADDRESS on Turing and older (T4, sm_75) or on any Windows / pre-535 Linux host; HMM-equipped setups keep the zero-staging host-pointer fast path and forward-only launches short-circuit before allocating
ebce54e to
afae697
Compare
… drop exhaustive arch lists With the AMDGPU device-context fix from PR #550 in the stack, the runtime-eval sizer path now runs cleanly on every backend. Two tests introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass` and `spirv_metadata_per_task_buffer`) get their arch lists extended to include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal, qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the default (no `arch=` arg) since that enumerates every supported backend anyway. Also drop the redundant `ad_stack_experimental_enabled=True` on `spirv_metadata_per_task_buffer`: `test_utils.test` sets that option unconditionally whenever `qd.extension.adstack` is in `require`, and carrying it on the decorator confuses readers into thinking the test needs it specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtime-evaluated adstack sizing on LLVM backends
TL;DR
Before this PR, the inner
range(n[None])forced the CFG analyzer into itsdefault_ad_stack_size=256fallback, son[None] = 512overflowed at the nextqd.sync(). After this PR the compiler captures the loop bound as aSizeExpr, the launcher evaluates it againstn[None]at launch time, and the backing heap is grown to exactly512 + 2slots.Why
Reverse-mode AD through a dynamic loop requires an adstack whose depth equals the forward trip count plus two setup pushes. Until now Quadrants had to resolve that depth at compile time, so every loop bound that involved a field load (
n_bodies,n_dofs, an ndarray shape, an enclosing parallel-for index) fell back to the user-configureddefault_ad_stack_size. For realistic reverse-mode kernels this is both:ndrange(n_bodies, n_dofs)easily needs >256 slots, triggering a runtimeQuadrantsAssertionError: Adstack overflow.The traditional workaround - tune
default_ad_stack_sizeper program - is not tunable enough: you cannot set a different default for every adstack, and bumping it globally multiplies GPU memory by the number of threads. This PR replaces the static fallback with a symbolic upper-bound expression captured at compile time and reduced per launch against the live field / ndarray state, so each adstack gets exactly the depth that dispatch needs.Entry point
LlvmRuntimeExecutor::publish_adstack_metadata(const AdStackSizingInfo &ad_stack, std::size_t num_threads, LaunchContextBuilder *ctx, void *device_runtime_context_ptr = nullptr)atquadrants/runtime/llvm/llvm_runtime_executor.cpp. Called fromcpu/cuda/amdgpu/kernel_launcher.cppright before every task's dispatch. It:size_exprsvector (post-orderSerializedSizeExprtrees, one perAdStackAllocaStmt).adstack_size_expr_eval.cpp, reading scalar field values throughSNodeRwAccessorsBankand launch-context ndarray shapes / scalar reads through theLaunchContextBuilder. The AMDGPU path threads the launcher's device-sideRuntimeContextstaging copy (kernel_launcher.cpp:185-186) throughdevice_runtime_context_ptrso HIP kernels that resolveExternalTensorReadleaves against ndarray pointers do not fault on a host pointer; CUDA relies on UVA and passesnullptr; CPU ignores the argument.LLVMRuntime::adstack_{per_thread_stride,offsets,max_sizes}viaruntime_get_adstack_metadata_field_ptrsso generated code reads them at every push / load-top site.ensure_adstack_heap(per_thread_stride * dispatched_threads).Shapes the grammar covers are sized exactly. Shapes outside the grammar (e.g. a
range(field[i])/range(ndarray[i])indexed by a loop-carried variable that the structural walker does not fold) fall back todefault_ad_stack_sizeand emit aQD_WARNnaming the source location so the user can either restructure the loop or extend the grammar.Mechanism end-to-end
1. Compile-time
SizeExprcapturequadrants/transforms/determine_ad_stack_size.cppruns a Bellman-Ford-first, structural-walker-second pre-pass that attempts to express eachAdStackAllocaStmt's max depth as aSizeExprtree. Grammar covered:SizeExpr::make_constGlobalLoadStmtof a scalari32/i64SNodemake_field_loadwith the SNode pointer + constant index pathExternalTensorShapeAlongAxisStmtmake_ndarray_shape(arg_id, axis)make_ndarray_read(arg_id, index_path)LoopIndexStmtof an enclosing bounded range-for / parallel-formake_loop_index_max(bounded by the loop'sendor amaxover every push site)AdStackLoadTopStmtof a stash adstackmake_stack_load_top_max(bounded by amaxover every push site)Inner operators:
add,sub,mul(const * anything viaAdd-replication),max.expr_sub's MaxOverRange fusion caps the fused iteration range atmin(shape_a, shape_b)(expressed in-grammar asa + b - max(a, b)) so the fused body cannot read the shorter ndarray past its buffer end when a caller passes mismatched-length ndarrays. Anything else falls back todefault_ad_stack_sizeand emits theQD_WARNnoted above.SizeExprobjects are immutable after the pre-pass writes them and are serialised as post-orderSerializedSizeExpr::nodesso offline-cache hits evaluate the same way as fresh compiles.2. Codegen reads stride / offset / max-size from runtime metadata
codegen_llvm.cppAdStackAllocaStmtheap + linear_tid * runtime->adstack_per_thread_stride + runtime->adstack_offsets[stack_id](all three loaded at runtime)codegen_llvm.cppAdStackPushStmtstack_push(runtime, stack, runtime->adstack_max_sizes[stack_id], element_size)(cap is a runtime load, not an IR immediate)init_offloaded_task_functionpre-scanAdStackAllocaStmt::stack_idin declaration order;ad_stack_offsets_becamestd::vector<std::size_t>indexed bystack_idThe heap base address is unchanged; only the per-thread stride and per-alloca offset math was moved from codegen-time immediates to runtime loads, so kernels compiled with the old flow will recompile but do not break semantics.
3. Host evaluator
adstack_size_expr_eval.{h,cpp}walks the flat post-order tree:SNodeRwAccessorsBankto read the current scalar value on the host side.LaunchContextBuilderwhich captures both the pointer and the shape on kernel launch.LoopIndexMaxandStackLoadTopMaxfold to whichever launch-time bound they were captured against.int64_t.The evaluator runs strictly on the host, before the kernel dispatches, so there is no device-side evaluator to keep in sync.
4. Per-kernel metadata bridge
runtime_get_adstack_metadata_field_ptrs(LLVMRuntime *)inruntime.cppreturns pointers toadstack_{per_thread_stride,offsets,max_sizes}. The host writes into these fields once per dispatch. On CPU the fields are read directly fromLLVMRuntime; on CUDA / AMDGPU the bridge triggers amemcpy_host_to_deviceso the generated code's subsequent loads see the right values. The memcpy is gated on the task actually carrying an adstack: kernels with no reverse-mode adstacks skippublish_adstack_metadataentirely and pay no per-launch cost.Per-backend coverage matrix
cpu/kernel_launcher.cppdefault_ad_stack_sizefallback +QD_WARNoutsidememcpy_host_to_devicecuda/kernel_launcher.cppmemcpy_host_to_device(host pointer routed through the launcher's device-sideRuntimeContextstaging copy)amdgpu/kernel_launcher.cppTests
tests/python/test_adstack.pytest_adstack_field_load_bounded_loop_evaluated_per_launch- pins the core guarantee: a kernel withfor _ in range(n[None])bound by an i32 scalar field. Launches atn_iter in (1, 20, 50)and asserts the heap resized correctly (no overflow at 50, no over-allocation at 1), with the decorator set todefault_ad_stack_size=2to prove the knob is bypassed when the grammar resolves the bound symbolically.test_adstack_near_capacity- rewritten to exercise both sides of the olddefault_ad_stack_size=32boundary atn_iter in (30, 100)without any knob override. Arch-restricted toqd.cpuuntil the SPIR-V sizer PR lands.test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends/..._with_mismatched_shape_ends- pin theexpr_subMaxOverRange fusion for both the strict-equality and the cross-ndarray same-axis paths.test_adstack_fuses_sub_of_max_over_range_with_mismatched_lengths_is_safe- pins themin(shape_a, shape_b)end cap so the cross-ndarray fusion stays in-bounds when the two ndarrays have different lengths at launch time (pre-fix the fused body read the shorter ndarray past its buffer end).test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index/test_adstack_inner_range_bounded_by_multidim_ndarray_read/test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var- cover theExternalTensorReadgrammar leaves, including multi-axis stride handling and the stash-then-reload pattern.tests/cpp/transforms/determine_ad_stack_size_test.cppSizeExprBuildFromRangeFor- const bounds fold to aConstleaf.SizeExprBuildFromFieldLoad- scalar i32 field load builds aFieldLoadleaf with the right SNode + index path.SizeExprBuildFromLoopIndex- innerfor j in range(i)under an enclosing bounded loop folds toLoopIndexMax.SizeExprRejectsUnsupportedOp- negative control that an unsupported operator routes through the warn-and-fallback path.Deliberate coverage gap: no SPIR-V cross-check here - the SPIR-V runtime metadata path lives in the stacked SPIR-V sizer PR and its tests cover the shader-side reading.
Side-effect audit
quadrants/analysis/offline_cache_util.cppserialisesSerializedSizeExpr::nodesalongside the existingsize_in_bytesperAdStackAllocaStmtdefault_ad_stack_size.statements.h:AdStackAllocaStmt::size_expris astd::shared_ptr<SizeExpr>cloned via the defaultQD_DEFINE_ACCEPT_AND_CLONESizeExpris immutable post pre-pass; shared pointer is safe to share across clonesir/ir.cppAdStackAllocaStmt::reprprintsstack_idand a shortSizeExprsummaryAdStackAllocaStmt::stack_id == -1falls through to the olddefault_ad_stack_size-backed pathSizeExprstill behave as beforeSizeExpris stored onAdStackAllocaStmt, not emitted as IR operands, so it does not participate in CSEadstack_{per_thread_stride,offsets,max_sizes}initialised to zero / null inruntime_initialize; kernels with no adstacks read nothing from them and skippublish_adstack_metadataentirelymemcpy_host_to_deviceexpr_sub's cross-ndarray fusion caps the fused end atmin(shape_a, shape_b)