Skip to content

[AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan)#493

Merged
duburcqa merged 2 commits intomainfrom
duburcqa/heap_backed_adstack
Apr 24, 2026
Merged

[AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan)#493
duburcqa merged 2 commits intomainfrom
duburcqa/heap_backed_adstack

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 17, 2026

Heap-backed adstack on SPIR-V backends (Metal, Vulkan)

Moves SPIR-V adstack storage from per-thread Function-scope arrays to per-dispatch heap StorageBuffers. Lifts the Metal / MoltenVK private-memory cap that made Autodiff 10's Function-scope path unusable for real reverse-mode kernels. Plus: Metal allocate_memory surfaces nil instead of silently wrapping, and advisory_total_num_threads is tightened at launch time against the actual ndrange shape.

TL;DR

// quadrants/codegen/spirv/kernel_utils.h
enum class BufferType : int {
  // ...
  AdStackHeapFloat,  // f32 adstacks (primal + adjoint interleaved)
  AdStackHeapInt,    // i32 adstacks (u1 reinterpreted as i32; primal-only, no adjoint slice)
};

Two new per-dispatch StorageBuffers. Each invocation owns an invoc_id * stride slice, sized by the pre-scanned per-thread stride × actual dispatched thread count. Shader indexing uses invoc_id * stride + offset + count, widened to u64 when spirv_has_int64 is available (and the runtime asserts at launch time that the product fits in u32 when it isn't). Other primitive types (f64, i64, …) are hard-errored: the heap packs only {f32, i32, u1}, so the old Function-scope fallback for exotic types is removed because it was never usable on Metal anyway.

Why

Autodiff 10's Function-scope SPIR-V adstack (per-thread Array<T, max_size>) kept working on small kernels but hit two walls on real workloads:

  1. Apple's MSL translator rejects pipelines whose per-thread private-memory footprint exceeds a few hundred kilobytes. A reverse-mode articulated-body dynamics step in Genesis has ~100 i32 and u1 adstacks at max_size=256, totalling ~130 KB per thread — well past the MSL compiler's budget. The pipeline create fails with XPC_ERROR_CONNECTION_INTERRUPTED, which is not even recoverable on retry.
  2. Even when the shader compiles, the per-thread Function-scope slice is permanent memory reservation in the driver's on-chip register allocation, which starves other kernels and drops occupancy.

Both constraints vanish once the storage lives in a shared StorageBuffer sliced by invoc_id. Per-thread shader footprint is O(1) regardless of max_size; the only real limit is MTLDevice.maxBufferLength and the driver's memory pool.

Mechanism

New buffer types (quadrants/codegen/spirv/kernel_utils.{h,cpp})

Two new BufferType enum values — AdStackHeapFloat and AdStackHeapInt — plus buffers_name() cases for them and for the pre-existing ListGen / ExtArr / AdStackOverflow types that were missing (debug buffers_name() calls were hitting QD_ERROR("unrecognized buffer type") on any binding involving those).

The int heap deliberately stores u1 as i32 (matched to the historical Function-scope bool→int remap in IRBuilder::get_array_type). It carries only the primal slice, not the adjoint: auto_diff.cpp's is_real guard only emits AccAdjoint / LoadTopAdj on real-typed stacks, so the int heap never needs an adjoint half.

Pre-scan + eager base emission (spirv_codegen.cpp)

TaskCodegen::run pre-scans the task body before any visitor runs and accumulates ad_stack_heap_per_thread_stride_float_ / ad_stack_heap_per_thread_stride_int_, and maps each AdStackAllocaStmt to its byte offset within the per-thread slice.

The per-thread heap base invoc_id * stride is emitted eagerly from visit(AdStackAllocaStmt) (not lazily at the first Push / LoadTop). Comment explicitly explains why: two sibling inner loops would reuse an SSA id defined in the first loop's body, which doesn't dominate the second — SPIR-V spec §2.16 dominance violation. Emitting the OpIMul at the outer dispatch body's insertion point guarantees it dominates every sibling loop body that later references it.

u32 vs u64 index arithmetic

When spirv_has_int64 is available the codegen widens invoc_id * stride + offset + count to u64 via OpUConvert. Without Int64 the codegen emits u32 OpIMul and the runtime asserts at launch time that stride * dispatched_threads <= u32_max to catch silent wrap-around aliasing into another thread's slice.

Hard-error non-{f32, i32, u1} types

visit(AdStackAllocaStmt) hard-errors exotic primitive types (f64, i64, f16, …). The dead Function-scope fallback is removed (the AdStackHeapKind::function_scope enum branch, primal_arr / adjoint_arr fields in AdStackSpirv, and the else branch in ad_stack_slot_ptr).

The decision is deliberate: the Function-scope path was demonstrably unusable on Metal for real workloads, and silently falling back to it would paper over a correctness/perf cliff. Hard-erroring surfaces the unsupported combination at compile time with a precise message instead of a silent "your gradient is now backed by ~40 GB of Function-scope memory and Metal returned nil".

Runtime heap growth (runtime/gfx/runtime.{h,cpp})

GfxRuntime gains four new fields: a DeviceAllocationGuard + size for each of the float and int heaps. launch_kernel computes required = stride * dispatched_threads * sizeof(element) per binding and grows the heap via amortised doubling when required > current_size. On grow:

size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique({new_size, ...});
// Fallback when doubling overshoots a device limit (e.g. Metal's maxBufferLength):
// retry at exactly `required` bytes before aborting.
if (res != RhiResult::success && new_size > required) {
  new_size = required;
  std::tie(buf, res) = device_->allocate_memory_unique({new_size, ...});
}
QD_ASSERT_INFO(res == RhiResult::success, ...);

The retry-at-required fallback covers Metal's maxBufferLength cap: at old_size=150 MB, required=165 MB the doubled 300 MB request fails, but the 165 MB retry succeeds. Without the fallback the process would abort with a spurious out-of-memory (claude bot flagged this; fix is applied symmetrically to both float and int grow paths).

Old buffers on grow are moved into ctx_buffers_ (deferred-free) rather than freed synchronously — any in-flight cmdlist referencing them stays valid. Autodiff 11's flush() fix is load-bearing here: clearing ctx_buffers_ on submit would GPU-side use-after-free the displaced buffers.

Empty-dispatch guard: when required == 0 (empty field) the binding uses kDeviceNullAllocation instead of asking the RHI for a zero-sized buffer, which trips RHI_ASSERT(params.size > 0) on Vulkan.

advisory_total_num_threads tightening

For SPIR-V dynamic range_for kernels, codegen previously set advisory_total_num_threads = kMaxNumThreadsGridStrideLoop = 131072 as the fallback because the range bound wasn't known at codegen time. The runtime then sized the per-dispatch adstack heap at 131072 * per_thread_stride * sizeof(element), which for a deep reverse kernel crossed Metal's maxBufferLength even when the actual iteration count was tiny.

This PR records the shape-lookup product backing a runtime-resolved end_stmt into a new RangeForAttributes::end_shape_product vector at codegen time. At launch, GfxRuntime::launch_kernel reads each referenced arr.shape[axis] from the LaunchContextBuilder args buffer and tightens advisory_total_num_threads to the actual launch-time iteration count (6 for a 2×3 ndarray, not 131072). The in-shader grid-stride loop already handles any dispatched thread count correctly; the tight cap just means each dispatched thread processes fewer idle strides.

Metal allocate_memory returns out_of_memory on nil

MetalDevice::allocate_memory now checks newBufferWithLength: == nil and returns RhiResult::out_of_memory (with an error log naming params.size and the device's maxBufferLength). Previously it wrapped nil in MetalMemory and returned RhiResult::success, and every subsequent setBuffer:atIndex:... bound nil — writes dropped silently, reads came back as zero, and reverse-mode kernels that hit this path produced NaN gradients without any error (divide-by-zero in a .normalized() sqrt adjoint that reloaded a never-actually-written primal).

Also surfaces Metal pipeline-creation failures that currently return *out_pipeline == nullptr as RhiResult::error instead of RhiResult::success, so launches on a null pipeline become catchable exceptions.

Docs (docs/source/user_guide/autodiff.md)

  • Drops the "SPIR-V on-chip cap" limitation from the known-limitations list.
  • Adds a "Memory cost" section with the formula num_threads * stack_size * bytes_per_element * num_loop_carried_variables and a per-backend element-size table (LLVM = 8 B for f32 / i32 because primal+adjoint; SPIR-V = 8 for f32, 4 for i32 because primal-only, 4 for bool widened to i32). Includes a worked example: ndrange(1024, 1024) × default_ad_stack_size=256 × 4 f32 vars ≈ 8 GB.
  • Order-of-remedies for OOM errors (drop default_ad_stack_size, reduce loop-carried vars, raise device_memory_fraction).

Tests

Concentrated in this PR because the heap-backed behaviour only exists after 493c lands:

  • test_adstack_rejects_unsupported_type — SPIR-V hard-errors f64 / i64 adstacks at compile time. Skip-gated on spirv_has_int8 (Vulkan drivers without it reject i8 at the SPIR-V type gate before the adstack guard fires). Uses i8 as the probe because Metal / MoltenVK rejects f64 at the field-writer stage before codegen.
  • test_adstack_mixed_f32_and_non_f32 — f32 + i32 adstacks in one kernel. Exercises both the AdStackHeapFloat and AdStackHeapInt paths simultaneously; finite-difference cross-check.
  • test_adstack_many_non_f32_stacks_heap_backed — six sibling dynamic loops × six data-dependent ifs = ~12 i32 + u1 adstacks per kernel on Metal. Function-scope storage would reject the pipeline; heap-backed keeps Function-scope memory bounded.
  • test_adstack_large_capacity_heap_backedad_stack_size=4096 on Metal with a single loop-carried variable. The old Function-scope path would fail shader compile; heap-backed runs to completion.
  • test_adstack_ndrange_over_ndarray_shape_does_not_oversize_heap — grad kernel over qd.ndrange(arr.shape[0], arr.shape[1]). Pre-fix, allocated ~40 GB of adstack heap (131072 fallback × 10 loop-carried × 4096 × 4). Post-fix, tightens to the actual 6-iteration count. Finite-difference cross-check guards against the nil-binding NaN mode.
  • test_adstack_near_capacity[overflow=True,False] — re-parametrized to pin default_ad_stack_size=32 on both sides of the K+2=size bound (previously only pinned the no-overflow side).

Side-effect audit

Concern Verdict
Exotic primitive types on SPIR-V Hard-errored with a clear message. Function-scope fallback was unusable on Metal for real workloads; keeping it would have been a correctness / perf cliff.
Metal maxBufferLength cap on grow Retry-at-required fallback (claude bot fix); symmetric on both float and int heaps.
Empty-field dispatches kDeviceNullAllocation path skips the zero-size allocation that trips RHI asserts.
AMDGPU dealloc_memory pooling Relies on the synchronous hipFree(context_pointer) tail added in Autodiff 11. Cross-launch safety invariant spelled out in the LLVM-side .h and .cpp adjacent to adstack_heap_alloc_.
Metal nil-buffer silent corruption Surfaced as RhiResult::out_of_memory and a Python exception.
SPIR-V dominance violation Eager invoc_id * stride emission from visit(AdStackAllocaStmt) (claude bot fix); header lazily comment rewritten to match.
u32 overflow on large dispatches Runtime-asserts when spirv_has_int64 is off; u64 widening via OpUConvert when it's on. OpBitcast(u64, u64) avoided.

Stack

Autodiff 13 of 13. Top-most of the "heap-backed adstack" triplet split. Based on #537 (LLVM heap). End of the chain.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e59c3d40a7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from e59c3d4 to 559dcd4 Compare April 17, 2026 11:12
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 5f87046 to c7e3de9 Compare April 17, 2026 11:19
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 559dcd4 to b3bb3f1 Compare April 17, 2026 11:20
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from c7e3de9 to f5d18d0 Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from b3bb3f1 to 190fc9c Compare April 17, 2026 11:37
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from f5d18d0 to 36973d4 Compare April 17, 2026 11:44
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 190fc9c to 3117982 Compare April 17, 2026 11:44
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 36973d4 to 2424db0 Compare April 17, 2026 11:53
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 3117982 to 486ef11 Compare April 17, 2026 11:53
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 2424db0 to 4f9c08f Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 486ef11 to 138b1f4 Compare April 17, 2026 12:12
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 4f9c08f to ea66a16 Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 138b1f4 to 56b29f5 Compare April 17, 2026 12:18
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from ea66a16 to 9120b6c Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 56b29f5 to ab56aae Compare April 17, 2026 12:29
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 9120b6c to 72df355 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from ab56aae to 2dcf9b4 Compare April 17, 2026 12:31
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 72df355 to a231bc8 Compare April 17, 2026 12:43
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 2dcf9b4 to 816900b Compare April 17, 2026 12:43
@duburcqa duburcqa changed the title [SPIRV] Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap [SPIRV] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap Apr 17, 2026
@duburcqa duburcqa changed the title [SPIRV] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap [AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap Apr 17, 2026
@duburcqa duburcqa force-pushed the duburcqa/fix_adstack_perf branch from a231bc8 to ff11ee2 Compare April 17, 2026 15:57
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 816900b to 3f8a0fe Compare April 17, 2026 15:57
@duburcqa
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
Comment thread quadrants/codegen/spirv/kernel_utils.h Outdated
@duburcqa duburcqa changed the title [AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) Apr 21, 2026
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from a619832 to cd038de Compare April 21, 2026 06:59
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 5975415 to 19ab524 Compare April 21, 2026 06:59
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from cd038de to 04d17cd Compare April 21, 2026 07:19
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 19ab524 to e5fbeb1 Compare April 21, 2026 07:20
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 quadrants/codegen/cuda/codegen_cuda.cpp:762-772 — Two new subgroup shuffle operations introduced by this PR — shuffle_down on CUDA/AMDGPU and shuffle_up on CUDA/AMDGPU — will always fail at JIT link time because the required runtime symbols and codegen dispatch cases are absent. Any kernel calling qd.simt.shuffle_down() or qd.simt.shuffle_up() on a CUDA or AMDGPU device will crash with an unresolved-symbol linker error; SPIR-V (Metal/Vulkan) is unaffected. The fix requires (1) adding cuda_shuffle_down_*/amdgpu_shuffle_down_* definitions to runtime.cpp and (2) adding a subgroupShuffleUp dispatch branch to both CUDA and AMDGPU codegen visitors.

    Extended reasoning...

    Bug 1 – Missing runtime symbols for shuffle_down

    The new emit_cuda_shuffle_down helper added in codegen_cuda.cpp (lines 762–772) emits LLVM call instructions for four symbols: cuda_shuffle_down_i32, cuda_shuffle_down_f32, cuda_shuffle_down_f64, and cuda_shuffle_down_i64. The parallel helper emit_amdgpu_shuffle_down in codegen_amdgpu.cpp (lines 455–472) emits corresponding amdgpu_shuffle_down_* symbols. None of these eight symbols are defined anywhere in the runtime module. runtime.cpp only defines the non-directional variants (cuda_shuffle_i32, etc.), and a search of the entire runtime tree confirms that no *_shuffle_down_* function body exists. At JIT link time the LLVM linker will fail with an unresolved external symbol error for every CUDA or AMDGPU kernel that calls qd.simt.shuffle_down().

    Bug 2 – subgroupShuffleUp registered but missing codegen on CUDA/AMDGPU

    internal_ops.inc.h now includes PER_INTERNAL_OP(subgroupShuffleUp) and type_system.cpp registers POLY_OP(subgroupShuffleUp, ...), making qd.simt.shuffle_up() a first-class callable Python API. The SPIR-V codegen correctly emits spv::OpGroupNonUniformShuffleUp. However, the visit(InternalFuncStmt*) override in TaskCodeGenCUDA (lines 730–745) handles subgroupShuffle, subgroupBroadcast, subgroupShuffleDown, and subgroupInvocationId but has no branch for subgroupShuffleUp. The same gap exists in TaskCodeGenAMDGPU. When neither override matches, control falls through to the base-class TaskCodeGenLLVM::visit(InternalFuncStmt*), which emits call(subgroupShuffleUp, args) — another undefined symbol — producing the same JIT linker failure.

    Concrete proof of failure

    Step-by-step for CUDA, shuffle_down with an i32 argument:

    1. User calls qd.simt.shuffle_down(x, 1) in a CUDA kernel.
    2. Frontend lowers this to an InternalFuncStmt with func_name = subgroupShuffleDown.
    3. TaskCodeGenCUDA::visit(InternalFuncStmt*) matches the subgroupShuffleDown branch and calls emit_cuda_shuffle_down(value, dt, offset).
    4. emit_cuda_shuffle_down for an i32 operand emits call(cuda_shuffle_down_i32, offset, value).
    5. The LLVM module for the runtime does not contain a definition of cuda_shuffle_down_i32; the JIT linker reports: undefined symbol: cuda_shuffle_down_i32.
    6. The kernel fails to launch; no user-visible error is reported other than the crash.

    For shuffle_up (CUDA, f32):

    1. User calls qd.simt.shuffle_up(x, 2).
    2. Lowers to InternalFuncStmt with func_name = subgroupShuffleUp.
    3. TaskCodeGenCUDA::visit(InternalFuncStmt*) has no branch for subgroupShuffleUp; falls through to base class.
    4. Base class emits call(subgroupShuffleUp, args) — undefined.
    5. JIT link fails: undefined symbol: subgroupShuffleUp.

    Why existing code doesn't prevent this

    There is no compile-time or type-system guard that rejects subgroupShuffleUp/subgroupShuffleDown for non-SPIR-V targets. The ops are unconditionally registered in type_system.cpp, so they pass type-checking on all backends. The codegen path only fails silently at the very last stage (JIT linking), making the bug hard to detect without actually running a CUDA or AMDGPU kernel.

    Suggested fixes

    For Bug 1: add the eight missing runtime definitions to runtime.cpp (or a new runtime_shuffle.cpp). For CUDA, each function wraps __shfl_down_sync(0xffffffff, v, offset). For AMDGPU, shuffle_down can be emulated with ds_bpermute addressing using (lane + offset) * 4, mirroring the existing non-directional shuffle helpers. For Bug 2: add a subgroupShuffleUp branch to TaskCodeGenCUDA::visit(InternalFuncStmt*) and the equivalent AMDGPU visitor, using a new emit_cuda_shuffle_up / emit_amdgpu_shuffle_up helper backed by __shfl_up_sync / the corresponding AMDGPU permute.

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 04d17cd to 4f43e77 Compare April 21, 2026 08:18
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from e5fbeb1 to 7d3d73a Compare April 21, 2026 08:18
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 4f43e77 to 10e5547 Compare April 21, 2026 08:36
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 7d3d73a to 1640c78 Compare April 21, 2026 08:36
Comment thread quadrants/runtime/gfx/runtime.cpp
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 10e5547 to 35b25a4 Compare April 21, 2026 09:51
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 1640c78 to 411dd48 Compare April 21, 2026 09:51
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 35b25a4 to 3a3e58c Compare April 21, 2026 12:03
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 411dd48 to 027c733 Compare April 21, 2026 12:03
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 3a3e58c to 98f2246 Compare April 21, 2026 13:24
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 027c733 to be6cdb3 Compare April 21, 2026 13:24
@duburcqa duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 98f2246 to c625fc5 Compare April 21, 2026 14:42
@duburcqa duburcqa force-pushed the duburcqa/heap_backed_adstack branch from be6cdb3 to d569cc5 Compare April 21, 2026 14:42
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@hughperkins
Copy link
Copy Markdown
Collaborator

  • generalizing existing functionality to new backends
    => no new doc or api to review

=> ok to merge

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all prior critical issues resolved, human reviewer approved.

Extended reasoning...

Overview

Modifies 9 files across SPIR-V codegen, GFX runtime, Metal RHI, and test suite to move adstack storage from per-thread Function-scope arrays to per-dispatch heap StorageBuffers (AdStackHeapFloat + AdStackHeapInt).

Security risks

No security-sensitive code paths (auth, crypto, permissions) are touched. The Metal nil-buffer fix improves memory safety by surfacing allocation failures as errors rather than silently producing NaN gradients.

Level of scrutiny

This PR received extensive scrutiny across multiple review rounds. All red-severity bugs (use-after-free on heap growth, SPIR-V dominance violation, u32 overflow, empty-dispatch crash, CUDA graph null-pointer) were found, reported, and fixed. Doc formula errors were corrected. The remaining inline comment (nit: test comment overstates the post-fix heap savings by 8×) is documentation-only with no behavioral impact.

Other factors

The single outstanding pre-existing issue (LLVM pre-scan missing StructForStmt/MeshForStmt branches) was introduced in companion PR #492, not this PR. A human reviewer reviewed all feedback and approved. The inline nit about the test comment arithmetic is posted separately.

Comment thread tests/python/test_adstack.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants