Skip to content

[AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr on LLVM backends (CPU/CUDA/AMDGPU)#550

Closed
duburcqa wants to merge 15 commits intomainfrom
duburcqa/adstack_llvm_runtime_size
Closed

[AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr on LLVM backends (CPU/CUDA/AMDGPU)#550
duburcqa wants to merge 15 commits intomainfrom
duburcqa/adstack_llvm_runtime_size

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 23, 2026

Runtime-evaluated adstack sizing on LLVM backends

Makes reverse-mode adstack capacity automatic on CPU / CUDA / AMDGPU: the compiler captures a symbolic upper-bound expression per adstack alloca and the host launcher resizes the per-thread heap just before each dispatch. No user-facing API change; default_ad_stack_size / ad_stack_size remain accepted in qd.init() and the runtime falls back to default_ad_stack_size (with a QD_WARN) when an adstack's enclosing loop shape is outside the SizeExpr grammar. Shapes inside the grammar are sized exactly at launch time.

TL;DR

import quadrants as qd

qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True)

n = qd.field(qd.i32, shape=())
x = qd.field(qd.f32, shape=16, needs_grad=True)
y = qd.field(qd.f32, shape=(), needs_grad=True)

@qd.kernel
def compute():
    for i in x:
        v = x[i]
        for _ in range(n[None]):           # trip count known only at launch time
            v = v * 0.95 + 0.01
        y[None] += v

n[None] = 512
compute()
y.grad[None] = 1.0
compute.grad()                              # no QuadrantsAssertionError: adstack sized to 512 + 2

Before this PR, the inner range(n[None]) forced the CFG analyzer into its default_ad_stack_size=256 fallback, so n[None] = 512 overflowed at the next qd.sync(). After this PR the compiler captures the loop bound as a SizeExpr, the launcher evaluates it against n[None] at launch time, and the backing heap is grown to exactly 512 + 2 slots.

Why

Reverse-mode AD through a dynamic loop requires an adstack whose depth equals the forward trip count plus two setup pushes. Until now Quadrants had to resolve that depth at compile time, so every loop bound that involved a field load (n_bodies, n_dofs, an ndarray shape, an enclosing parallel-for index) fell back to the user-configured default_ad_stack_size. For realistic reverse-mode kernels this is both:

  • Wrong-too-small: a 14-dof mechanism stepping through an ndrange(n_bodies, n_dofs) easily needs >256 slots, triggering a runtime QuadrantsAssertionError: Adstack overflow.
  • Wrong-too-big: bumping the default to cover worst case inflates every adstack in the program, including the ones the compiler could prove were tiny, and scales linearly with the dispatched ndrange on GPU.

The traditional workaround - tune default_ad_stack_size per program - is not tunable enough: you cannot set a different default for every adstack, and bumping it globally multiplies GPU memory by the number of threads. This PR replaces the static fallback with a symbolic upper-bound expression captured at compile time and reduced per launch against the live field / ndarray state, so each adstack gets exactly the depth that dispatch needs.

Entry point

LlvmRuntimeExecutor::publish_adstack_metadata(const AdStackSizingInfo &ad_stack, std::size_t num_threads, LaunchContextBuilder *ctx, void *device_runtime_context_ptr = nullptr) at quadrants/runtime/llvm/llvm_runtime_executor.cpp. Called from cpu/cuda/amdgpu/kernel_launcher.cpp right before every task's dispatch. It:

  1. Walks the task's size_exprs vector (post-order SerializedSizeExpr trees, one per AdStackAllocaStmt).
  2. Evaluates each tree via adstack_size_expr_eval.cpp, reading scalar field values through SNodeRwAccessorsBank and launch-context ndarray shapes / scalar reads through the LaunchContextBuilder. The AMDGPU path threads the launcher's device-side RuntimeContext staging copy (kernel_launcher.cpp:185-186) through device_runtime_context_ptr so HIP kernels that resolve ExternalTensorRead leaves against ndarray pointers do not fault on a host pointer; CUDA relies on UVA and passes nullptr; CPU ignores the argument.
  3. Computes per-thread stride, per-alloca offsets, and per-alloca max-sizes.
  4. Writes these into LLVMRuntime::adstack_{per_thread_stride,offsets,max_sizes} via runtime_get_adstack_metadata_field_ptrs so generated code reads them at every push / load-top site.
  5. Grows the per-thread heap backing buffer via ensure_adstack_heap(per_thread_stride * dispatched_threads).

Shapes the grammar covers are sized exactly. Shapes outside the grammar (e.g. a range(field[i]) / range(ndarray[i]) indexed by a loop-carried variable that the structural walker does not fold) fall back to default_ad_stack_size and emit a QD_WARN naming the source location so the user can either restructure the loop or extend the grammar.

Mechanism end-to-end

1. Compile-time SizeExpr capture

quadrants/transforms/determine_ad_stack_size.cpp runs a Bellman-Ford-first, structural-walker-second pre-pass that attempts to express each AdStackAllocaStmt's max depth as a SizeExpr tree. Grammar covered:

Leaf Source Lowering
Const integer literal SizeExpr::make_const
FieldLoad GlobalLoadStmt of a scalar i32 / i64 SNode make_field_load with the SNode pointer + constant index path
NdarrayShape ExternalTensorShapeAlongAxisStmt make_ndarray_shape(arg_id, axis)
NdarrayRead scalar read from an integer ndarray make_ndarray_read(arg_id, index_path)
LoopIndexMax LoopIndexStmt of an enclosing bounded range-for / parallel-for make_loop_index_max (bounded by the loop's end or a max over every push site)
StackLoadTopMax AdStackLoadTopStmt of a stash adstack make_stack_load_top_max (bounded by a max over every push site)

Inner operators: add, sub, mul (const * anything via Add-replication), max. expr_sub's MaxOverRange fusion caps the fused iteration range at min(shape_a, shape_b) (expressed in-grammar as a + b - max(a, b)) so the fused body cannot read the shorter ndarray past its buffer end when a caller passes mismatched-length ndarrays. Anything else falls back to default_ad_stack_size and emits the QD_WARN noted above. SizeExpr objects are immutable after the pre-pass writes them and are serialised as post-order SerializedSizeExpr::nodes so offline-cache hits evaluate the same way as fresh compiles.

2. Codegen reads stride / offset / max-size from runtime metadata

File Change
codegen_llvm.cpp AdStackAllocaStmt base addr = heap + linear_tid * runtime->adstack_per_thread_stride + runtime->adstack_offsets[stack_id] (all three loaded at runtime)
codegen_llvm.cpp AdStackPushStmt stack_push(runtime, stack, runtime->adstack_max_sizes[stack_id], element_size) (cap is a runtime load, not an IR immediate)
init_offloaded_task_function pre-scan assigns AdStackAllocaStmt::stack_id in declaration order; ad_stack_offsets_ became std::vector<std::size_t> indexed by stack_id

The heap base address is unchanged; only the per-thread stride and per-alloca offset math was moved from codegen-time immediates to runtime loads, so kernels compiled with the old flow will recompile but do not break semantics.

Note: default_ad_stack_size and ad_stack_size are ignored on LLVM for adstacks whose bound is inside the grammar. They remain in qd.init() for SPIR-V backwards compatibility and for the warn-and-fallback path where the pre-pass cannot resolve the loop shape symbolically.

3. Host evaluator

adstack_size_expr_eval.{h,cpp} walks the flat post-order tree:

  • Field loads go through the existing SNodeRwAccessorsBank to read the current scalar value on the host side.
  • Ndarray shape / scalar reads go through LaunchContextBuilder which captures both the pointer and the shape on kernel launch.
  • LoopIndexMax and StackLoadTopMax fold to whichever launch-time bound they were captured against.
  • Arithmetic is saturating int64_t.

The evaluator runs strictly on the host, before the kernel dispatches, so there is no device-side evaluator to keep in sync.

4. Per-kernel metadata bridge

runtime_get_adstack_metadata_field_ptrs(LLVMRuntime *) in runtime.cpp returns pointers to adstack_{per_thread_stride,offsets,max_sizes}. The host writes into these fields once per dispatch. On CPU the fields are read directly from LLVMRuntime; on CUDA / AMDGPU the bridge triggers a memcpy_host_to_device so the generated code's subsequent loads see the right values. The memcpy is gated on the task actually carrying an adstack: kernels with no reverse-mode adstacks skip publish_adstack_metadata entirely and pay no per-launch cost.

Per-backend coverage matrix

Backend Pre-pass Metadata bridge Launcher hook Status
CPU (LLVM) ok direct field access cpu/kernel_launcher.cpp auto-sized inside the grammar; default_ad_stack_size fallback + QD_WARN outside
CUDA (LLVM) ok memcpy_host_to_device cuda/kernel_launcher.cpp same
AMDGPU (LLVM) ok memcpy_host_to_device (host pointer routed through the launcher's device-side RuntimeContext staging copy) amdgpu/kernel_launcher.cpp same
Metal (SPIR-V) N/A here N/A here N/A here untouched - still compile-time, handled by the stacked SPIR-V sizer PR
Vulkan (SPIR-V) N/A here N/A here N/A here untouched - still compile-time, handled by the stacked SPIR-V sizer PR

Tests

tests/python/test_adstack.py

  • test_adstack_field_load_bounded_loop_evaluated_per_launch - pins the core guarantee: a kernel with for _ in range(n[None]) bound by an i32 scalar field. Launches at n_iter in (1, 20, 50) and asserts the heap resized correctly (no overflow at 50, no over-allocation at 1), with the decorator set to default_ad_stack_size=2 to prove the knob is bypassed when the grammar resolves the bound symbolically.
  • test_adstack_near_capacity - rewritten to exercise both sides of the old default_ad_stack_size=32 boundary at n_iter in (30, 100) without any knob override. Arch-restricted to qd.cpu until the SPIR-V sizer PR lands.
  • test_adstack_structural_pre_pass_fuses_sub_of_max_over_range_with_matching_shape_ends / ..._with_mismatched_shape_ends - pin the expr_sub MaxOverRange fusion for both the strict-equality and the cross-ndarray same-axis paths.
  • test_adstack_fuses_sub_of_max_over_range_with_mismatched_lengths_is_safe - pins the min(shape_a, shape_b) end cap so the cross-ndarray fusion stays in-bounds when the two ndarrays have different lengths at launch time (pre-fix the fused body read the shorter ndarray past its buffer end).
  • test_adstack_inner_range_bounded_by_ndarray_read_at_outer_index / test_adstack_inner_range_bounded_by_multidim_ndarray_read / test_adstack_ext_tensor_read_indexed_by_stashed_outer_loop_var - cover the ExternalTensorRead grammar leaves, including multi-axis stride handling and the stash-then-reload pattern.

tests/cpp/transforms/determine_ad_stack_size_test.cpp

  • SizeExprBuildFromRangeFor - const bounds fold to a Const leaf.
  • SizeExprBuildFromFieldLoad - scalar i32 field load builds a FieldLoad leaf with the right SNode + index path.
  • SizeExprBuildFromLoopIndex - inner for j in range(i) under an enclosing bounded loop folds to LoopIndexMax.
  • SizeExprRejectsUnsupportedOp - negative control that an unsupported operator routes through the warn-and-fallback path.

Deliberate coverage gap: no SPIR-V cross-check here - the SPIR-V runtime metadata path lives in the stacked SPIR-V sizer PR and its tests cover the shader-side reading.

Side-effect audit

Concern Where checked Verdict
Offline cache key quadrants/analysis/offline_cache_util.cpp serialises SerializedSizeExpr::nodes alongside the existing size_in_bytes per AdStackAllocaStmt ok - cache-hit reloads the same tree; evaluator runs the same on fresh + cache-hit. Empty-tree caches (pre-PR offline caches) fall back to default_ad_stack_size.
Stmt clone / serialization statements.h: AdStackAllocaStmt::size_expr is a std::shared_ptr<SizeExpr> cloned via the default QD_DEFINE_ACCEPT_AND_CLONE ok - SizeExpr is immutable post pre-pass; shared pointer is safe to share across clones
IR printer ir/ir.cpp AdStackAllocaStmt::repr prints stack_id and a short SizeExpr summary ok - IR dump readable
Codegen non-feature path AdStackAllocaStmt::stack_id == -1 falls through to the old default_ad_stack_size-backed path ok - kernels compiled without the pre-pass writing a SizeExpr still behave as before
Whole-kernel CSE SizeExpr is stored on AdStackAllocaStmt, not emitted as IR operands, so it does not participate in CSE ok
LaunchContextBuilder extended to capture scalar + shape for every ndarray arg; no change to the existing argument-pack layout - new entries appended ok - old kernels still build their context the same way
Runtime initialization adstack_{per_thread_stride,offsets,max_sizes} initialised to zero / null in runtime_initialize; kernels with no adstacks read nothing from them and skip publish_adstack_metadata entirely ok - zero-adstack kernels pay nothing
Multi-threaded CPU launch metadata is written once per dispatch before the thread pool starts; no concurrent writers ok
CUDA / AMDGPU memcpy_host_to_device one memcpy per adstack-bearing task per launch; measured overhead ~microseconds. Forward-only kernels and reverse kernels whose pre-pass resolves all bounds at compile time are exempt. acceptable; would be visible only on workloads that launch thousands of tiny autodiff kernels per second
Mismatched-length ndarray fusion expr_sub's cross-ndarray fusion caps the fused end at min(shape_a, shape_b) ok - pre-fix UB on mismatched shapes; post-fix both reads stay in-bounds

@duburcqa duburcqa changed the base branch from duburcqa/runtime_adstack_capacity to duburcqa/moltenvk_sdk_source April 23, 2026 05:38
@duburcqa duburcqa changed the title [AutoDiff] Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr on LLVM backends (CPU/CUDA/AMDGPU) [AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr on LLVM backends (CPU/CUDA/AMDGPU) Apr 23, 2026
@duburcqa duburcqa marked this pull request as draft April 23, 2026 05:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc3bf3f5bb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/program/adstack_size_expr_eval.cpp Outdated
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 453d0a8 to 416d869 Compare April 23, 2026 05:43
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from bc3bf3f to b757b5a Compare April 23, 2026 05:43
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 416d869 to f213cd6 Compare April 23, 2026 05:54
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from b757b5a to 4b1b402 Compare April 23, 2026 05:54
Comment thread quadrants/runtime/llvm/llvm_runtime_executor.cpp Outdated
Comment thread quadrants/program/adstack_size_expr_eval.cpp Outdated
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp Outdated
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch 2 times, most recently from babc027 to 5182192 Compare April 23, 2026 06:12
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 6c6d25d to 14ae74f Compare April 23, 2026 07:57
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch 3 times, most recently from b472593 to 8905b76 Compare April 23, 2026 09:33
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 67c2eb7 to 44679dd Compare April 23, 2026 10:53
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 8905b76 to d8773c9 Compare April 23, 2026 10:56
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 44679dd to 8123ff1 Compare April 23, 2026 11:45
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from d8773c9 to a4e514f Compare April 23, 2026 11:46
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 8123ff1 to 526292d Compare April 23, 2026 12:56
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from a4e514f to 1a62fd4 Compare April 23, 2026 12:56
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 526292d to 860a914 Compare April 23, 2026 13:15
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 1a62fd4 to 595138f Compare April 23, 2026 13:15
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 860a914 to e9091bc Compare April 23, 2026 13:17
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 595138f to 8a95d10 Compare April 23, 2026 13:17
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from e9091bc to 63ebc09 Compare April 23, 2026 13:19
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 8a95d10 to 04afb7b Compare April 23, 2026 13:19
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 63ebc09 to 8fdb352 Compare April 23, 2026 15:01
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from fcb659e to 8a4f62e Compare April 23, 2026 15:01
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 8fdb352 to 69fffa2 Compare April 23, 2026 15:35
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 5b6aec9 to 59101da Compare April 24, 2026 14:00
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from bad2fb0 to 2e2754c Compare April 24, 2026 14:59
duburcqa added a commit that referenced this pull request Apr 24, 2026
… drop exhaustive arch lists

With the AMDGPU device-context fix from PR #550 in the stack, the
runtime-eval sizer path now runs cleanly on every backend. Two tests
introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass`
and `spirv_metadata_per_task_buffer`) get their arch lists extended to
include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal,
qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed
in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the
default (no `arch=` arg) since that enumerates every supported backend
anyway.

Also drop the redundant `ad_stack_experimental_enabled=True` on
`spirv_metadata_per_task_buffer`: `test_utils.test` sets that option
unconditionally whenever `qd.extension.adstack` is in `require`, and
carrying it on the decorator confuses readers into thinking the test
needs it specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@duburcqa duburcqa force-pushed the duburcqa/moltenvk_sdk_source branch from 59101da to 2859a1d Compare April 24, 2026 15:17
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 2e2754c to 0be734d Compare April 24, 2026 15:17
duburcqa added a commit that referenced this pull request Apr 24, 2026
… drop exhaustive arch lists

With the AMDGPU device-context fix from PR #550 in the stack, the
runtime-eval sizer path now runs cleanly on every backend. Two tests
introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass`
and `spirv_metadata_per_task_buffer`) get their arch lists extended to
include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal,
qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed
in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the
default (no `arch=` arg) since that enumerates every supported backend
anyway.

Also drop the redundant `ad_stack_experimental_enabled=True` on
`spirv_metadata_per_task_buffer`: `test_utils.test` sets that option
unconditionally whenever `qd.extension.adstack` is in `require`, and
carrying it on the decorator confuses readers into thinking the test
needs it specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread quadrants/program/adstack_size_expr_eval.cpp
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp
Comment thread quadrants/transforms/determine_ad_stack_size.cpp
Base automatically changed from duburcqa/moltenvk_sdk_source to main April 24, 2026 17:24
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from 0be734d to ebce54e Compare April 24, 2026 18:07
duburcqa added a commit that referenced this pull request Apr 24, 2026
… drop exhaustive arch lists

With the AMDGPU device-context fix from PR #550 in the stack, the
runtime-eval sizer path now runs cleanly on every backend. Two tests
introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass`
and `spirv_metadata_per_task_buffer`) get their arch lists extended to
include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal,
qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed
in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the
default (no `arch=` arg) since that enumerates every supported backend
anyway.

Also drop the redundant `ad_stack_experimental_enabled=True` on
`spirv_metadata_per_task_buffer`: `test_utils.test` sets that option
unconditionally whenever `qd.extension.adstack` is in `require`, and
carrying it on the decorator confuses readers into thinking the test
needs it specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +313 to +318
- Adstack overflow does not exist on LLVM backends for kernels the pre-pass can bound (every shape in the grammar listed under "Under the hood: adstack capacity and memory"). On SPIR-V backends the compile-time-sized capacity can still be exceeded at runtime; overflow is reported as a Python-level exception asynchronously - the offending kernel writes to a host-polled SSBO flag during execution, and the next `qd.sync()` (explicit, or implicit via a host read like `to_numpy()` / `to_torch()`) reads the flag and raises. This follows the same pattern as CUDA async errors so every launch does not pay a per-launch sync. If you want the exception to land exactly at the offending kernel rather than at the next sync, call `qd.sync()` right after the kernel.
- Adstack trades compile time for generality. Kernels with many loop-carried variables, nested dynamic loops, or large inner-loop bodies produce visibly slow compile times - seconds stretching into minutes, and on SPIR-V backends sometimes into the territory where the driver's shader compiler gives up. Budget compile-time accordingly when migrating existing reverse-mode AD workloads.
- Reverse-mode AD does not propagate gradients through integer casts or non-real operations. No error is raised; the gradient simply stops at the cast and silently reads as zero upstream. Cast to `qd.f32` / `qd.f64` before the differentiable section.
- Backward passes on non-trivial kernels run noticeably slower than the corresponding forward pass, sometimes by an order of magnitude on SPIR-V.
- **Loop bounds read from a writable ndarray are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]`, the sizer snapshots `n[j]` at backward-dispatch entry. Any scenario that makes that snapshot diverge from the value the kernel body ends up executing against drives the adstack undersized and the gradient silently comes out zero or wrong. Two patterns to avoid: (a) the same kernel writes `n[j]` before the loop reads it (`for i in range(n[j])` after `n[j] = something`); (b) across kernels, where `kernel_A` writes `n` and `kernel_B.grad()` reads it as a loop bound - the per-launch numpy-ndarray upload can overwrite the device copy with host-side state that differs from what `kernel_A` left behind, and for `qd.ndarray` a missing host-side barrier between the two launches has the same effect. Quadrants does not detect either pattern. Workaround: treat ndarrays the reverse-mode kernel reads as loop bounds as read-only within and across the relevant kernel chain - compute iteration counts once, store them in a dedicated scalar field or ndarray that is never written during the backward path, and iterate over that.
- **Loop bounds that read an ndarray the same kernel writes are unsafe.** If a reverse-mode kernel has a loop whose iteration count comes from `n[j]` and the same kernel also writes to `n[j]` before that loop runs (`for i in range(n[j])` after `n[j] = something`), the computed gradient may silently come out zero or wrong. Quadrants does not currently detect this. Workaround: put iteration counts in a separate ndarray (or scalar field) the kernel only reads, or split the write and the differentiable loop into two `@qd.kernel` calls. The same caveat applies across kernels - a kernel that writes an ndarray read as a loop bound by a later `.grad()` call needs the host-side contents of that ndarray to already hold the value the backward kernel will execute against.
- **Deeply nested dynamic-range loops can still trip an adstack overflow at runtime even when every loop bound is inside the pre-pass grammar.** The compiler builds a symbolic upper-bound expression for each adstack and evaluates it at dispatch time against the live ndarray contents; on kernels with many levels of nested `for i in range(arr[...])` feeding cumulative-index patterns the expression can be mathematically tighter than the actual push count. Symptom is `Adstack overflow (offending stack_id=N)` on the next `qd.sync()`. Workaround until the pre-pass grammar is extended: shorten the reverse kernel's loop nest, precompute the worst-case inner trip into a scalar field the kernel only reads, or split the inner section into its own `@qd.kernel`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The Known Limitations section at docs/source/user_guide/autodiff.md:313 ("Adstack overflow does not exist on LLVM backends for kernels the pre-pass can bound (every shape in the grammar listed under "Under the hood: adstack capacity and memory")") directly contradicts line 318 in the same bullet list ("Deeply nested dynamic-range loops can still trip an adstack overflow at runtime even when every loop bound is inside the pre-pass grammar") — both explicitly reference shapes inside the grammar so they cannot both hold. Two supporting claims earlier in the same doc are also false relative to quadrants/transforms/determine_ad_stack_size.cpp:1370-1376, which only emits QD_WARN and falls back to alloca->max_size = config.default_ad_stack_size for unresolved grammar shapes: line 256 ("Loop shapes outside this grammar raise a compile error ... silent over-allocation is not possible") and line 258 ("default_ad_stack_size and ad_stack_size are ignored on LLVM"). Doc-only, low blast radius but user-misleading; the author already committed to a docs fix in resolved comment #3137465122 and the active hughperkins threads (#3138031879 / #3138006046 / #3138009751) are going to touch this section anyway, so it's worth rewriting the four lines to describe the actual warn-and-fallback semantics rather than shipping an internal contradiction.

Extended reasoning...

What the bug is

docs/source/user_guide/autodiff.md contains an internal contradiction inside the Known Limitations bullet list that ships with this PR:

  • Line 313: "Adstack overflow does not exist on LLVM backends for kernels the pre-pass can bound (every shape in the grammar listed under "Under the hood: adstack capacity and memory")."
  • Line 318: "Deeply nested dynamic-range loops can still trip an adstack overflow at runtime even when every loop bound is inside the pre-pass grammar."

Both bullets explicitly reference shapes inside the grammar. They cannot both be true.

Two other load-bearing claims earlier in the document are also false relative to the code this PR ships:

  • Line 256: "Loop shapes outside this grammar raise a compile error naming the source location of the offending adstack alloca; silent over-allocation is not possible."
  • Line 258: "default_ad_stack_size and ad_stack_size are ignored on LLVM."

The code path

quadrants/transforms/determine_ad_stack_size.cpp runs Bellman-Ford + structural SizeExpr pre-pass; any alloca still unresolved afterwards falls into the branch at lines 1362-1376:

const int fallback = std::max<int>(config.default_ad_stack_size, 1);
QD_WARN(
    "adstack bound at {} is unresolved after Bellman-Ford + structural pre-pass{}; falling back to"
    " default_ad_stack_size={}.{} ...",
    alloca->get_last_tb(), cycle_note, fallback, dump_hint);
alloca->max_size = static_cast<std::size_t>(fallback);
alloca->size_expr = SizeExpr::make_const(static_cast<int64_t>(fallback));

That is QD_WARN (not QD_ERROR) and max_size = default_ad_stack_size. The in-code comment at lines 1346-1354 explicitly says hard-erroring was rolled back because it broke test_double_for_loops / test_double_for_loops_more_nests, and that the runtime overflow check still fires at qd.sync() if the fallback is exceeded. So:

  1. Line 256 is wrong: there is no compile error and silent over-allocation is exactly what the fallback does.
  2. Line 258 is wrong: default_ad_stack_size is the load-bearing knob in that fallback path on LLVM.
  3. Line 313 is wrong (or, equivalently, line 318 is wrong): the grammar-covered-but-pre-pass-gave-up case exists and is a runtime overflow path.

Step-by-step proof

  1. User writes a reverse-mode kernel with a shape the grammar lists as supported in the "Under the hood" section (e.g. a FieldLoad-bounded loop like for j in range(field[i]) where i is a loop-carried variable — the comment at determine_ad_stack_size.cpp:1353 calls out exactly this pattern as "the grammar does not (yet) cover those shapes").
  2. Bellman-Ford leaves alloca->max_size == 0 (positive loop).
  3. Structural build_value_expr returns nullptr (grammar gap inside a shape the user-facing docs called supported).
  4. The fallback loop at lines 1362-1376 fires: QD_WARN is emitted (trivial to miss), alloca->max_size = 256 (the default), compile succeeds.
  5. At launch, publish_adstack_metadata writes 256 into adstack_max_sizes[i]. The actual kernel depth exceeds 256, trips runtime->adstack_overflow_flag, qd.sync() raises QuadrantsAssertionError.
  6. This manifests the behavior line 318 admits. Line 313 promised it could not.

Impact

Docs-only, low blast radius for existing users (no code break), but actively misleading for any reader trying to reason about whether they need default_ad_stack_size on LLVM. A user reading line 258 will not set the knob; a user reading line 313 will not account for overflow paths that actually exist in the same kernels. The internal 313-vs-318 contradiction is also a self-refuting artifact in the same bullet list — reading top-to-bottom the user cannot tell which bullet to trust.

Addressing the duplicate refutation

One verifier flagged this as a duplicate of the earlier inline-comment 3137066389 ("previous_5"). That earlier comment was marked resolved at #3137465122 with the author saying the docs were updated to match the warn-and-fallback semantics. Reading the updated text at lines 246/256/258/313/318 in this PR's current diff, the update was incomplete: lines 256 / 258 still carry the old "compile error" / "ignored on LLVM" claims, and the 313 / 318 pair is a new internal contradiction introduced by the same rewrite. So this is not a no-op re-submission — the three docs-vs-code mismatches are still live and the self-contradiction is specific to this PR's text. Re-reporting as nit rather than normal because the direction of the fix (doc rewrite) is already accepted per the author's own resolution comment, and because hughperkins' active review threads (#3138031879 flags exactly line 313, #3138009751 asks to move SizeExpr/grammar details out of user-facing docs, #3138006046 flags the whole advanced section) are going to touch these lines anyway — the report is a specific concrete contradiction to resolve when that rewrite lands, not a fresh direction to take.

How to fix

Rewrite lines 246 / 256 / 258 / 313 to describe the actual warn-and-fallback semantics, matching the intent of line 318:

  • Line 246: drop the "no runtime overflow path to hit" claim — the runtime overflow path exists for grammar gaps.
  • Line 256: replace "raise a compile error ... silent over-allocation is not possible" with "emit a QD_WARN at compile time naming the source location and fall back to default_ad_stack_size; runtime overflow still surfaces at the next qd.sync() if the fallback is exceeded".
  • Line 258: change "ignored on LLVM" to "still used on LLVM for adstacks whose bound is outside the SizeExpr grammar (the warn-and-fallback path)".
  • Line 313: either remove it entirely or rephrase to match line 318 (runtime overflow exists for the grammar-gap fallback path; it does not exist for shapes the pre-pass can symbolically bound).

Alternatively, per hughperkins' review comments, move the SizeExpr grammar / sizing detail out of the user-facing guide and into an internal README, which makes the precision question moot at the user-facing surface.

🔬 also observed by verify-runtime

Comment on lines +593 to +597
// Two `GlobalLoadStmt` source shapes matter: before `lower_access` the source is a `GlobalPtrStmt` that carries
// the snode and the raw index stmts; after `lower_access` it becomes a `GetChStmt` rooted in a chain of
// `SNodeLookupStmt` / `GetChStmt` / `GetRootStmt`. `determine_ad_stack_size` runs after `lower_access` in the
// normal pipeline (the structural pre-pass used to accept only the pre-lowering shape and therefore gave up on
// every scalar field load), so we walk both forms here.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two stale comments and a now-unreachable branch from this PR's earlier reorder of the adstack pre-pass. (1) In quadrants/transforms/determine_ad_stack_size.cpp:593-597, the GlobalLoadStmt comment says the pre-pass "runs after lower_access in the normal pipeline" — the opposite is true now that it runs at compile_to_offloads.cpp:161 (before offload_to_executable's lower_access at line 268), and the GetChStmt sub-branch at lines 746-858 is dead code in the normal pipeline since GetChStmt is only produced by lower_access. (2) In tests/python/test_adstack.py:1651-1655, the "CPU leg is load-bearing for the chunk-loop guard" rationale is stale for the same reason — make_cpu_multithreaded_range_for runs at compile_to_offloads.cpp:210 (inside offload_to_executable), strictly after the pre-pass. Nit-level: no functional impact today, but the comment is actively misleading and the dead branch will bit-rot; fix by deleting the GetChStmt branch, updating the GlobalLoadStmt comment to state the pre-pass operates only on pre-lower_access IR, and dropping the chunk-loop rationale from the test comment (the alloca_scope_chain guard in compute_bounded_adstack_size is still justified by the stack_init per-entry reset invariant).

Extended reasoning...

What the stale comments and dead code claim vs. what the pipeline actually does

This PR (commit adding determine_ad_stack_size to compile_to_offloads) moved the adstack pre-pass from offload_to_executable to the end of compile_to_offloads. Three artifacts of that move were left behind:

1. Stale in-code comment at determine_ad_stack_size.cpp:593-597:

// Two `GlobalLoadStmt` source shapes matter: before `lower_access` the source is a `GlobalPtrStmt` ...
// after `lower_access` it becomes a `GetChStmt` ... `determine_ad_stack_size` runs after `lower_access`
// in the normal pipeline ... so we walk both forms here.

The factual claim "determine_ad_stack_size runs after lower_access in the normal pipeline" is now false. Evidence:

  • compile_to_offloads.cpp:161 (inside compile_to_offloads) calls irpass::determine_ad_stack_size(ir, config).
  • compile_to_offloads.cpp:268 (inside offload_to_executable) calls irpass::lower_access(...).
  • compile_to_executable runs compile_to_offloads first and offload_to_executable second (via KernelCodeGen::compile_kernel_to_module's sequencing).

So on the normal pipeline, the pre-pass sees pre-lower_access IR — the file's own comment at compile_to_offloads.cpp:288-292 explicitly acknowledges (void)determine_ad_stack_size; because the old call site inside offload_to_executable was removed.

2. Dead GetChStmt sub-branch at determine_ad_stack_size.cpp:746-858:

GetChStmt nodes are only produced by lower_access (e.g. lower_access.cpp push_back<GetChStmt> sites). Since the pre-pass now runs strictly before lower_access in every production code path, load->src->cast<GetChStmt>() at line 746 always returns nullptr and the entire ~112-line branch is unreachable. The C++ unit tests in tests/cpp/transforms/determine_ad_stack_size_test.cpp construct IR via IRBuilder directly and never invoke lower_access, so this branch has zero test coverage.

3. Stale test comment at tests/python/test_adstack.py:1651-1655:

# The CPU leg is load-bearing for the chunk-loop guard: the grad-kernel's CPU-multithreaded outer
# `range_for` wraps the user body with `min(N, (thread_idx << k) + chunk_size)`-style bounds,
# and the structural pre-pass must stop walking at the `AdStackAllocaStmt`'s own scope ...
# or the outer chunk loop's unparseable bounds would leak into the multiplier and flip the size to unresolved.

make_cpu_multithreaded_range_for runs at compile_to_offloads.cpp:210 (inside offload_to_executable), which executes strictly after compile_to_offloads's pre-pass call. So at the point determine_ad_stack_size runs, the min(N, (thread_idx << k) + chunk_size) chunk wrapper the comment describes does not yet exist in the IR. The file itself spells this out in the same pass at determine_ad_stack_size.cpp:427-429 — "the user's original RangeForStmt bounds are still visible here and the chunk wrapper's min(end_stmt, bit_shl-add) shape never appears in this pre-pass's input IR."

Why this is worth filing despite no functional impact

Severity is clearly nit — all six verifiers agreed on that; none of the three items affects correctness today. But:

  • The in-code comment at 595-596 is load-bearing for a future reader's mental model of when the pass runs relative to lower_access. A maintainer debugging the pass will consult this comment before any external docs, and the current wording sends them looking in the wrong direction.
  • The GetChStmt branch is 112 lines of conditionally-compiled code that (a) adds maintenance burden to any future refactor of the GlobalLoadStmt handler, (b) has no test coverage, and (c) will bit-rot silently because no CI exercise touches it. If the pass is ever re-added to a post-lower_access position, the branch should be re-derived against the current IR shape rather than restored verbatim from a drifting copy.
  • The test comment compounds the misleading documentation: a reader seeing the CPU gate thinks the test is specifically covering the chunk-loop interaction, when in fact it is covering the field-load sizing and the CPU gate is just a follow-up-PR scoping restriction.

Step-by-step proof

  1. Open compile_to_offloads.cpp. Line 161 (inside compile_to_offloads): irpass::determine_ad_stack_size(ir, config);. Line 210 (inside offload_to_executable): irpass::make_cpu_multithreaded_range_for(ir, config);. Line 268 (inside offload_to_executable): irpass::lower_access(...);.
  2. Search KernelCodeGen::compile_kernel_to_module / compile_to_executable: compile_to_offloads is invoked before offload_to_executable.
  3. Therefore when determine_ad_stack_size runs, lower_access has not yet executed and make_cpu_multithreaded_range_for has not yet executed.
  4. GetChStmt is only produced by lower_access (grep push_back<GetChStmt> → hits are in lower_access.cpp).
  5. Therefore GlobalLoadStmt::src is never GetChStmt during this pre-pass — the branch at determine_ad_stack_size.cpp:746-858 is unreachable.
  6. Therefore the outer chunk-loop wrapper the test comment describes does not exist at pre-pass time — that specific rationale for the alloca_scope_chain guard no longer applies (though the guard itself is still sound for the stack_init reset reason).

How to fix

  • quadrants/transforms/determine_ad_stack_size.cpp:593-597: rewrite the comment to state the pre-pass operates only on pre-lower_access IR (so GlobalLoadStmt::src is always GlobalPtrStmt or ExternalPtrStmt / GlobalTemporaryStmt).
  • quadrants/transforms/determine_ad_stack_size.cpp:746-858: delete the GetChStmt branch entirely; if the pass is ever re-added post-lower_access, re-derive against the then-current IR shape.
  • tests/python/test_adstack.py:1651-1655: drop the chunk-loop rationale. The alloca_scope_chain guard in compute_bounded_adstack_size is still justified by the AdStackAllocaStmt::stack_init reset-per-entry invariant (the walker must stop at the alloca's own scope so outer loops do not compound per-iteration depth).

Comment on lines +564 to +568
// Publish the per-task adstack metadata into the LLVMRuntime struct and size the heap. The codegen path loads
// stride / offset / max_size from these fields at every `AdStack*` site (see `ensure_ad_stack_metadata_llvm` in
// codegen_llvm.cpp), so we must write them before every launch even for tasks where the compile-time and
// launch-time bounds agree. `evaluate_adstack_size_expr` is called only when the symbolic tree is available; the
// offline cache does not currently serialize `SizeExpr`, so cache hits fall back to `max_size_compile_time`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment at llvm_runtime_executor.cpp:564-568 claims the offline cache does not serialize SizeExpr and cache-hit kernels therefore fall back to max_size_compile_time, but the cache actually round-trips the symbolic tree end-to-end (AdStackSizingInfo::QD_IO_DEF at llvm_compiled_data.h:54-62 includes size_exprs, and both SerializedSizeExpr and SerializedSizeExprNode carry full QD_IO_DEFs). A matching stale comment at llvm_compiled_data.h:28-32 ("after offline-cache load where the symbolic tree is not serialized") directly contradicts the accurate sibling comment at lines 49-52 in the same file. Nit: update both comments to say that SizeExpr round-trips through the cache and that max_size_compile_time is only the fallback for an empty tree (Bellman-Ford-resolved or unresolved kernels).

Extended reasoning...

What the bug is

Two stale in-code comments claim that the offline cache does not serialize SizeExpr, and that a cache hit therefore routes every adstack's sizing back through max_size_compile_time. Neither is true under this PR.

  • quadrants/runtime/llvm/llvm_runtime_executor.cpp:564-568:
    // evaluate_adstack_size_expr is called only when the symbolic tree is available; the
    // offline cache does not currently serialize SizeExpr, so cache hits fall back to
    // max_size_compile_time.
  • quadrants/codegen/llvm/llvm_compiled_data.h:28-32:

    "max_size_compile_time is the compile-time bound (used when size_expr is absent, e.g. after offline-cache load where the symbolic tree is not serialized)"

Both sites carry the same stale claim. The sibling comment on AdStackSizingInfo::size_exprs at llvm_compiled_data.h:49-52 already states the correct behaviour ("The flat post-order form survives the offline cache (an empty nodes vector means 'no symbolic bound captured'..."), so the file is internally self-contradictory.

Verification that SizeExpr does survive the cache

The cache path serializes every field via QD_IO_DEF:

  1. AdStackSizingInfo::QD_IO_DEF at llvm_compiled_data.h:54-62 explicitly lists size_exprs alongside allocas and per_thread_stride.
  2. SerializedSizeExpr at quadrants/ir/adstack_size_expr.h:56 is QD_IO_DEF(nodes).
  3. SerializedSizeExprNode at adstack_size_expr.h:39 is QD_IO_DEF(kind, const_value, snode_id, indices, operand_a, operand_b, var_id, body_node_idx, arg_id_path, arg_shape_axis) — every field of the flat node.
  4. OffloadedTask::ad_stack participates in QD_IO_DEF (via the enclosing task serialization), so the AdStackSizingInfo on each task rehydrates fully from cache.

A cache-hit kernel therefore reloads its size_exprs vector and publish_adstack_metadata runs evaluate_adstack_size_expr on it exactly like a fresh compile. max_size_compile_time is only read when the tree is empty (expr->nodes.empty()), which happens for kernels whose capacity the Bellman-Ford phase resolved structurally without emitting a symbolic tree, or the warn-and-fallback path. That is a tree-emptiness gate, not a cache-hit gate.

Impact

No runtime behaviour is affected — SizeExpr really does flow through the cache and really is evaluated on cache-hit launches. The defect is purely documentation: a maintainer auditing the cache-hit path (e.g. wondering why a cached kernel still works when bumping default_ad_stack_size, or debugging a per-launch sizer failure they suspect is cache-triggered) will read either stale comment and form an incorrect mental model of what runs when. The two stale sites share copy-pasted language, so the next reader hitting both sees a consistent but wrong picture.

Step-by-step proof

  1. CompiledKernelData serialises via QD_IO_DEF(compiled_data)LLVMCompiledKernel::QD_IO_DEF(tasks)OffloadedTask::QD_IO_DEF(..., ad_stack)AdStackSizingInfo::QD_IO_DEF(..., size_exprs).
  2. On cache load, size_exprs[i] comes back non-empty whenever the pre-pass captured a symbolic tree.
  3. publish_adstack_metadata picks the SerializedSizeExpr *expr out of ad_stack.size_exprs[i], checks !expr->nodes.empty(), and calls evaluate_adstack_size_expr (CPU) or encode_adstack_size_expr_device_bytecode + runtime_eval_adstack_size_expr (GPU).
  4. max_size_compile_time is read only via the if (v < 0) branch (CPU) or root_node_idx < 0 (device), both of which correspond to empty-tree fallback — independent of cache state.

How to fix

Rewrite the llvm_runtime_executor.cpp:564-568 comment to describe the actual gate (tree-emptiness, not cache-hit), and rewrite the llvm_compiled_data.h:28-32 comment to match the accurate sibling at lines 49-52. Two small one-line rewrites; no code change needed.

duburcqa and others added 15 commits April 24, 2026 23:36
…er kernel on AMDGPU

The runtime-eval sizer kernel dereferences `ctx->arg_buffer` on device to
resolve `ExternalTensorRead` leaves against ndarray pointers. CUDA survived
via UVA (the host `&ctx->get_context()` resolves transparently on the
device); AMDGPU/HIP has no UVA for non-pinned host allocations, so the
host pointer faulted inside the kernel and the next DtoH sync raised
`hipErrorIllegalAddress ... while calling memcpy_device_to_host` on every
reverse-mode kernel whose adstack has a non-empty `size_exprs` tree.

The AMDGPU launcher already stages a device-side copy of the
`RuntimeContext` at `kernel_launcher.cpp:185-186`; thread that
`context_pointer` through `publish_adstack_metadata` and use it as the
sizer kernel's `ctx` argument. CUDA / CPU pass `nullptr` and the
function falls back to `&ctx->get_context()`, preserving the existing
UVA / in-process paths there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tests to cover AMDGPU

Follow-up to the AMDGPU device-context fix: extend the five CUDA-covered
runtime-eval tests (`field_load_bounded_loop_evaluated_per_launch`,
`inner_range_bounded_by_ndarray_read_at_outer_index`,
`ext_tensor_read_indexed_by_stashed_outer_loop_var`,
`structural_pre_pass_fuses_sub_of_max_over_range_with_{matching,mismatched}_shape_ends`)
to AMDGPU now that the sizer kernel receives a device-side
`RuntimeContext` pointer and runs cleanly there.

Also drop the redundant `ad_stack_experimental_enabled=True` on
`test_adstack_triangular_ndrange_self_referential_push_idempotency`:
`test_utils.test` already sets that option unconditionally whenever
`qd.extension.adstack` is in `require`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_a, shape_b) so the fused body cannot read the shorter ndarray past its buffer end
…tmp hardcoded dump path from adstack SizeExpr comments
…_struct_arg_host so CUDA/AMDGPU launchers no longer deref a swapped device arg_buffer from host
…M-iter ranges and hoist os import in test_adstack
…prDeviceMaxBoundVars) at encode time so the device interpreter's fixed-size scope array stays in bounds regardless of the host-side monotonic var_id counter
…ainst max_size_compile_time (it is only a conservative seed, not a proven bound) and precompute per-axis strides in host evaluate_external_tensor_read so CPU multi-dim ndarray reads resolve to the correct cell
…n: parametrize the mismatched-lengths test with a peak-past-shape-b case that silently under-sizes the adstack, and add a fld[i]-arr[i] test that the fusion must not synthesize a mixed FieldLoad+ExtRead body the LLVM-GPU encoder cannot lower
…h_shape_same_axis widening (cross-ndarray independent loops no longer fabricate an index pairing that drops arr_a peaks past shape_b); decline fusion when the synthesized body would mix a FieldLoad indexed by a free bound variable with an ExternalTensorRead (the LLVM device interpreter has no on-device SNode access); fall back to MaxOverRange_a alone as a sound upper bound on Sub(MaxOverRange, MaxOverRange) whenever fusion cannot fire
… SizeExpr sizer kernel when the CUDA driver + kernel do not expose HMM / system-allocated memory (queried via CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS), mirroring the AMDGPU launcher's existing pattern so the sizer's ctx->arg_buffer PSB read no longer faults with CUDA_ERROR_ILLEGAL_ADDRESS on Turing and older (T4, sm_75) or on any Windows / pre-535 Linux host; HMM-equipped setups keep the zero-staging host-pointer fast path and forward-only launches short-circuit before allocating
@duburcqa duburcqa force-pushed the duburcqa/adstack_llvm_runtime_size branch from ebce54e to afae697 Compare April 24, 2026 21:39
duburcqa added a commit that referenced this pull request Apr 24, 2026
… drop exhaustive arch lists

With the AMDGPU device-context fix from PR #550 in the stack, the
runtime-eval sizer path now runs cleanly on every backend. Two tests
introduced in this PR (`bounded_inner_loop_sized_by_structural_prepass`
and `spirv_metadata_per_task_buffer`) get their arch lists extended to
include AMDGPU, and the few `[qd.cpu, qd.cuda, qd.amdgpu, qd.metal,
qd.vulkan]` lists already inherited from the rebase (after AMDGPU landed
in PR #550 and Metal/Vulkan landed in `2d9f6328d`) collapse to the
default (no `arch=` arg) since that enumerates every supported backend
anyway.

Also drop the redundant `ad_stack_experimental_enabled=True` on
`spirv_metadata_per_task_buffer`: `test_utils.test` sets that option
unconditionally whenever `qd.extension.adstack` is in `require`, and
carrying it on the decorator confuses readers into thinking the test
needs it specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@duburcqa duburcqa closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants