[AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) by duburcqa · Pull Request #493 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-17T10:15:48Z

Heap-backed adstack on SPIR-V backends (Metal, Vulkan)

Moves SPIR-V adstack storage from per-thread Function-scope arrays to per-dispatch heap StorageBuffers. Lifts the Metal / MoltenVK private-memory cap that made Autodiff 10's Function-scope path unusable for real reverse-mode kernels. Plus: Metal allocate_memory surfaces nil instead of silently wrapping, and advisory_total_num_threads is tightened at launch time against the actual ndrange shape.

TL;DR

// quadrants/codegen/spirv/kernel_utils.h
enum class BufferType : int {
  // ...
  AdStackHeapFloat,  // f32 adstacks (primal + adjoint interleaved)
  AdStackHeapInt,    // i32 adstacks (u1 reinterpreted as i32; primal-only, no adjoint slice)
};

Two new per-dispatch StorageBuffers. Each invocation owns an invoc_id * stride slice, sized by the pre-scanned per-thread stride × actual dispatched thread count. Shader indexing uses invoc_id * stride + offset + count, widened to u64 when spirv_has_int64 is available (and the runtime asserts at launch time that the product fits in u32 when it isn't). Other primitive types (f64, i64, …) are hard-errored: the heap packs only {f32, i32, u1}, so the old Function-scope fallback for exotic types is removed because it was never usable on Metal anyway.

Why

Autodiff 10's Function-scope SPIR-V adstack (per-thread Array<T, max_size>) kept working on small kernels but hit two walls on real workloads:

Apple's MSL translator rejects pipelines whose per-thread private-memory footprint exceeds a few hundred kilobytes. A reverse-mode articulated-body dynamics step in Genesis has ~100 i32 and u1 adstacks at max_size=256, totalling ~130 KB per thread — well past the MSL compiler's budget. The pipeline create fails with XPC_ERROR_CONNECTION_INTERRUPTED, which is not even recoverable on retry.
Even when the shader compiles, the per-thread Function-scope slice is permanent memory reservation in the driver's on-chip register allocation, which starves other kernels and drops occupancy.

Both constraints vanish once the storage lives in a shared StorageBuffer sliced by invoc_id. Per-thread shader footprint is O(1) regardless of max_size; the only real limit is MTLDevice.maxBufferLength and the driver's memory pool.

Mechanism

New buffer types (`quadrants/codegen/spirv/kernel_utils.{h,cpp}`)

Two new BufferType enum values — AdStackHeapFloat and AdStackHeapInt — plus buffers_name() cases for them and for the pre-existing ListGen / ExtArr / AdStackOverflow types that were missing (debug buffers_name() calls were hitting QD_ERROR("unrecognized buffer type") on any binding involving those).

The int heap deliberately stores u1 as i32 (matched to the historical Function-scope bool→int remap in IRBuilder::get_array_type). It carries only the primal slice, not the adjoint: auto_diff.cpp's is_real guard only emits AccAdjoint / LoadTopAdj on real-typed stacks, so the int heap never needs an adjoint half.

Pre-scan + eager base emission (`spirv_codegen.cpp`)

TaskCodegen::run pre-scans the task body before any visitor runs and accumulates ad_stack_heap_per_thread_stride_float_ / ad_stack_heap_per_thread_stride_int_, and maps each AdStackAllocaStmt to its byte offset within the per-thread slice.

The per-thread heap base invoc_id * stride is emitted eagerly from visit(AdStackAllocaStmt) (not lazily at the first Push / LoadTop). Comment explicitly explains why: two sibling inner loops would reuse an SSA id defined in the first loop's body, which doesn't dominate the second — SPIR-V spec §2.16 dominance violation. Emitting the OpIMul at the outer dispatch body's insertion point guarantees it dominates every sibling loop body that later references it.

u32 vs u64 index arithmetic

When spirv_has_int64 is available the codegen widens invoc_id * stride + offset + count to u64 via OpUConvert. Without Int64 the codegen emits u32 OpIMul and the runtime asserts at launch time that stride * dispatched_threads <= u32_max to catch silent wrap-around aliasing into another thread's slice.

Hard-error non-{f32, i32, u1} types

visit(AdStackAllocaStmt) hard-errors exotic primitive types (f64, i64, f16, …). The dead Function-scope fallback is removed (the AdStackHeapKind::function_scope enum branch, primal_arr / adjoint_arr fields in AdStackSpirv, and the else branch in ad_stack_slot_ptr).

The decision is deliberate: the Function-scope path was demonstrably unusable on Metal for real workloads, and silently falling back to it would paper over a correctness/perf cliff. Hard-erroring surfaces the unsupported combination at compile time with a precise message instead of a silent "your gradient is now backed by ~40 GB of Function-scope memory and Metal returned nil".

Runtime heap growth (`runtime/gfx/runtime.{h,cpp}`)

GfxRuntime gains four new fields: a DeviceAllocationGuard + size for each of the float and int heaps. launch_kernel computes required = stride * dispatched_threads * sizeof(element) per binding and grows the heap via amortised doubling when required > current_size. On grow:

size_t new_size = std::max(required, 2 * adstack_heap_buffer_float_size_);
auto [buf, res] = device_->allocate_memory_unique({new_size, ...});
// Fallback when doubling overshoots a device limit (e.g. Metal's maxBufferLength):
// retry at exactly `required` bytes before aborting.
if (res != RhiResult::success && new_size > required) {
  new_size = required;
  std::tie(buf, res) = device_->allocate_memory_unique({new_size, ...});
}
QD_ASSERT_INFO(res == RhiResult::success, ...);

The retry-at-required fallback covers Metal's maxBufferLength cap: at old_size=150 MB, required=165 MB the doubled 300 MB request fails, but the 165 MB retry succeeds. Without the fallback the process would abort with a spurious out-of-memory (claude bot flagged this; fix is applied symmetrically to both float and int grow paths).

Old buffers on grow are moved into ctx_buffers_ (deferred-free) rather than freed synchronously — any in-flight cmdlist referencing them stays valid. Autodiff 11's flush() fix is load-bearing here: clearing ctx_buffers_ on submit would GPU-side use-after-free the displaced buffers.

Empty-dispatch guard: when required == 0 (empty field) the binding uses kDeviceNullAllocation instead of asking the RHI for a zero-sized buffer, which trips RHI_ASSERT(params.size > 0) on Vulkan.

`advisory_total_num_threads` tightening

For SPIR-V dynamic range_for kernels, codegen previously set advisory_total_num_threads = kMaxNumThreadsGridStrideLoop = 131072 as the fallback because the range bound wasn't known at codegen time. The runtime then sized the per-dispatch adstack heap at 131072 * per_thread_stride * sizeof(element), which for a deep reverse kernel crossed Metal's maxBufferLength even when the actual iteration count was tiny.

This PR records the shape-lookup product backing a runtime-resolved end_stmt into a new RangeForAttributes::end_shape_product vector at codegen time. At launch, GfxRuntime::launch_kernel reads each referenced arr.shape[axis] from the LaunchContextBuilder args buffer and tightens advisory_total_num_threads to the actual launch-time iteration count (6 for a 2×3 ndarray, not 131072). The in-shader grid-stride loop already handles any dispatched thread count correctly; the tight cap just means each dispatched thread processes fewer idle strides.

Metal `allocate_memory` returns `out_of_memory` on nil

MetalDevice::allocate_memory now checks newBufferWithLength: == nil and returns RhiResult::out_of_memory (with an error log naming params.size and the device's maxBufferLength). Previously it wrapped nil in MetalMemory and returned RhiResult::success, and every subsequent setBuffer:atIndex:... bound nil — writes dropped silently, reads came back as zero, and reverse-mode kernels that hit this path produced NaN gradients without any error (divide-by-zero in a .normalized() sqrt adjoint that reloaded a never-actually-written primal).

Also surfaces Metal pipeline-creation failures that currently return *out_pipeline == nullptr as RhiResult::error instead of RhiResult::success, so launches on a null pipeline become catchable exceptions.

Docs (`docs/source/user_guide/autodiff.md`)

Drops the "SPIR-V on-chip cap" limitation from the known-limitations list.
Adds a "Memory cost" section with the formula num_threads * stack_size * bytes_per_element * num_loop_carried_variables and a per-backend element-size table (LLVM = 8 B for f32 / i32 because primal+adjoint; SPIR-V = 8 for f32, 4 for i32 because primal-only, 4 for bool widened to i32). Includes a worked example: ndrange(1024, 1024) × default_ad_stack_size=256 × 4 f32 vars ≈ 8 GB.
Order-of-remedies for OOM errors (drop default_ad_stack_size, reduce loop-carried vars, raise device_memory_fraction).

Tests

Concentrated in this PR because the heap-backed behaviour only exists after 493c lands:

test_adstack_rejects_unsupported_type — SPIR-V hard-errors f64 / i64 adstacks at compile time. Skip-gated on spirv_has_int8 (Vulkan drivers without it reject i8 at the SPIR-V type gate before the adstack guard fires). Uses i8 as the probe because Metal / MoltenVK rejects f64 at the field-writer stage before codegen.
test_adstack_mixed_f32_and_non_f32 — f32 + i32 adstacks in one kernel. Exercises both the AdStackHeapFloat and AdStackHeapInt paths simultaneously; finite-difference cross-check.
test_adstack_many_non_f32_stacks_heap_backed — six sibling dynamic loops × six data-dependent ifs = ~12 i32 + u1 adstacks per kernel on Metal. Function-scope storage would reject the pipeline; heap-backed keeps Function-scope memory bounded.
test_adstack_large_capacity_heap_backed — ad_stack_size=4096 on Metal with a single loop-carried variable. The old Function-scope path would fail shader compile; heap-backed runs to completion.
test_adstack_ndrange_over_ndarray_shape_does_not_oversize_heap — grad kernel over qd.ndrange(arr.shape[0], arr.shape[1]). Pre-fix, allocated ~40 GB of adstack heap (131072 fallback × 10 loop-carried × 4096 × 4). Post-fix, tightens to the actual 6-iteration count. Finite-difference cross-check guards against the nil-binding NaN mode.
test_adstack_near_capacity[overflow=True,False] — re-parametrized to pin default_ad_stack_size=32 on both sides of the K+2=size bound (previously only pinned the no-overflow side).

Side-effect audit

Concern	Verdict
Exotic primitive types on SPIR-V	Hard-errored with a clear message. Function-scope fallback was unusable on Metal for real workloads; keeping it would have been a correctness / perf cliff.
Metal `maxBufferLength` cap on grow	Retry-at-required fallback (claude bot fix); symmetric on both float and int heaps.
Empty-field dispatches	`kDeviceNullAllocation` path skips the zero-size allocation that trips RHI asserts.
AMDGPU `dealloc_memory` pooling	Relies on the synchronous `hipFree(context_pointer)` tail added in Autodiff 11. Cross-launch safety invariant spelled out in the LLVM-side `.h` and `.cpp` adjacent to `adstack_heap_alloc_`.
Metal nil-buffer silent corruption	Surfaced as `RhiResult::out_of_memory` and a Python exception.
SPIR-V dominance violation	Eager `invoc_id * stride` emission from `visit(AdStackAllocaStmt)` (claude bot fix); header `lazily` comment rewritten to match.
u32 overflow on large dispatches	Runtime-asserts when `spirv_has_int64` is off; u64 widening via `OpUConvert` when it's on. `OpBitcast(u64, u64)` avoided.

Stack

Autodiff 13 of 13. Top-most of the "heap-backed adstack" triplet split. Based on #537 (LLVM heap). End of the chain.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e59c3d40a7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

duburcqa · 2026-04-17T16:53:06Z

@claude review

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 quadrants/codegen/cuda/codegen_cuda.cpp:762-772 — Two new subgroup shuffle operations introduced by this PR — shuffle_down on CUDA/AMDGPU and shuffle_up on CUDA/AMDGPU — will always fail at JIT link time because the required runtime symbols and codegen dispatch cases are absent. Any kernel calling qd.simt.shuffle_down() or qd.simt.shuffle_up() on a CUDA or AMDGPU device will crash with an unresolved-symbol linker error; SPIR-V (Metal/Vulkan) is unaffected. The fix requires (1) adding cuda_shuffle_down_*/amdgpu_shuffle_down_* definitions to runtime.cpp and (2) adding a subgroupShuffleUp dispatch branch to both CUDA and AMDGPU codegen visitors.
Extended reasoning...

Bug 1 – Missing runtime symbols for shuffle_down

The new emit_cuda_shuffle_down helper added in codegen_cuda.cpp (lines 762–772) emits LLVM call instructions for four symbols: cuda_shuffle_down_i32, cuda_shuffle_down_f32, cuda_shuffle_down_f64, and cuda_shuffle_down_i64. The parallel helper emit_amdgpu_shuffle_down in codegen_amdgpu.cpp (lines 455–472) emits corresponding amdgpu_shuffle_down_* symbols. None of these eight symbols are defined anywhere in the runtime module. runtime.cpp only defines the non-directional variants (cuda_shuffle_i32, etc.), and a search of the entire runtime tree confirms that no *_shuffle_down_* function body exists. At JIT link time the LLVM linker will fail with an unresolved external symbol error for every CUDA or AMDGPU kernel that calls qd.simt.shuffle_down().

Bug 2 – subgroupShuffleUp registered but missing codegen on CUDA/AMDGPU

internal_ops.inc.h now includes PER_INTERNAL_OP(subgroupShuffleUp) and type_system.cpp registers POLY_OP(subgroupShuffleUp, ...), making qd.simt.shuffle_up() a first-class callable Python API. The SPIR-V codegen correctly emits spv::OpGroupNonUniformShuffleUp. However, the visit(InternalFuncStmt*) override in TaskCodeGenCUDA (lines 730–745) handles subgroupShuffle, subgroupBroadcast, subgroupShuffleDown, and subgroupInvocationId but has no branch for subgroupShuffleUp. The same gap exists in TaskCodeGenAMDGPU. When neither override matches, control falls through to the base-class TaskCodeGenLLVM::visit(InternalFuncStmt*), which emits call(subgroupShuffleUp, args) — another undefined symbol — producing the same JIT linker failure.

Concrete proof of failure

Step-by-step for CUDA, shuffle_down with an i32 argument:
1. User calls qd.simt.shuffle_down(x, 1) in a CUDA kernel.
2. Frontend lowers this to an InternalFuncStmt with func_name = subgroupShuffleDown.
3. TaskCodeGenCUDA::visit(InternalFuncStmt*) matches the subgroupShuffleDown branch and calls emit_cuda_shuffle_down(value, dt, offset).
4. emit_cuda_shuffle_down for an i32 operand emits call(cuda_shuffle_down_i32, offset, value).
5. The LLVM module for the runtime does not contain a definition of cuda_shuffle_down_i32; the JIT linker reports: undefined symbol: cuda_shuffle_down_i32.
6. The kernel fails to launch; no user-visible error is reported other than the crash.
For shuffle_up (CUDA, f32):
1. User calls qd.simt.shuffle_up(x, 2).
2. Lowers to InternalFuncStmt with func_name = subgroupShuffleUp.
3. TaskCodeGenCUDA::visit(InternalFuncStmt*) has no branch for subgroupShuffleUp; falls through to base class.
4. Base class emits call(subgroupShuffleUp, args) — undefined.
5. JIT link fails: undefined symbol: subgroupShuffleUp.
Why existing code doesn't prevent this

There is no compile-time or type-system guard that rejects subgroupShuffleUp/subgroupShuffleDown for non-SPIR-V targets. The ops are unconditionally registered in type_system.cpp, so they pass type-checking on all backends. The codegen path only fails silently at the very last stage (JIT linking), making the bug hard to detect without actually running a CUDA or AMDGPU kernel.

Suggested fixes

For Bug 1: add the eight missing runtime definitions to runtime.cpp (or a new runtime_shuffle.cpp). For CUDA, each function wraps __shfl_down_sync(0xffffffff, v, offset). For AMDGPU, shuffle_down can be emulated with ds_bpermute addressing using (lane + offset) * 4, mirroring the existing non-directional shuffle helpers. For Bug 2: add a subgroupShuffleUp branch to TaskCodeGenCUDA::visit(InternalFuncStmt*) and the equivalent AMDGPU visitor, using a new emit_cuda_shuffle_up / emit_amdgpu_shuffle_up helper backed by __shfl_up_sync / the corresponding AMDGPU permute.

hughperkins · 2026-04-21T18:59:37Z

generalizing existing functionality to new backends
=> no new doc or api to review

=> ok to merge

claude

LGTM — all prior critical issues resolved, human reviewer approved.

Extended reasoning...

Overview

Modifies 9 files across SPIR-V codegen, GFX runtime, Metal RHI, and test suite to move adstack storage from per-thread Function-scope arrays to per-dispatch heap StorageBuffers (AdStackHeapFloat + AdStackHeapInt).

Security risks

No security-sensitive code paths (auth, crypto, permissions) are touched. The Metal nil-buffer fix improves memory safety by surfacing allocation failures as errors rather than silently producing NaN gradients.

Level of scrutiny

This PR received extensive scrutiny across multiple review rounds. All red-severity bugs (use-after-free on heap growth, SPIR-V dominance violation, u32 overflow, empty-dispatch crash, CUDA graph null-pointer) were found, reported, and fixed. Doc formula errors were corrected. The remaining inline comment (nit: test comment overstates the post-fix heap savings by 8×) is documentation-only with no behavioral impact.

Other factors

The single outstanding pre-existing issue (LLVM pre-scan missing StructForStmt/MeshForStmt branches) was introduced in companion PR #492, not this PR. A human reviewer reviewed all feedback and approved. The inline nit about the test comment arithmetic is posted separately.

… (backend-aware i8 rejection, all archs for ndrange sizing)

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

claude Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from e59c3d4 to 559dcd4 Compare April 17, 2026 11:12

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 5f87046 to c7e3de9 Compare April 17, 2026 11:19

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 559dcd4 to b3bb3f1 Compare April 17, 2026 11:20

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from c7e3de9 to f5d18d0 Compare April 17, 2026 11:37

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from b3bb3f1 to 190fc9c Compare April 17, 2026 11:37

duburcqa mentioned this pull request Apr 17, 2026

[AutoDiff] Autodiff 10: Implement adstack for SPIR-V #490

Merged

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from f5d18d0 to 36973d4 Compare April 17, 2026 11:44

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 190fc9c to 3117982 Compare April 17, 2026 11:44

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 36973d4 to 2424db0 Compare April 17, 2026 11:53

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 3117982 to 486ef11 Compare April 17, 2026 11:53

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 2424db0 to 4f9c08f Compare April 17, 2026 12:12

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 486ef11 to 138b1f4 Compare April 17, 2026 12:12

duburcqa mentioned this pull request Apr 17, 2026

[AutoDiff] Autodiff 1: Add baseline adstack regression test for unary_collections #500

Merged

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 4f9c08f to ea66a16 Compare April 17, 2026 12:18

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 138b1f4 to 56b29f5 Compare April 17, 2026 12:18

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from ea66a16 to 9120b6c Compare April 17, 2026 12:29

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 56b29f5 to ab56aae Compare April 17, 2026 12:29

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 9120b6c to 72df355 Compare April 17, 2026 12:31

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from ab56aae to 2dcf9b4 Compare April 17, 2026 12:31

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from 72df355 to a231bc8 Compare April 17, 2026 12:43

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 2dcf9b4 to 816900b Compare April 17, 2026 12:43

duburcqa mentioned this pull request Apr 17, 2026

[Bug]: loss.backward() hangs indefinitely for articulated robots with freejoint + child joints Genesis-Embodied-AI/Genesis#2537

Open

duburcqa changed the title ~~[SPIRV] Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap~~ [SPIRV] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap Apr 17, 2026

duburcqa changed the title ~~[SPIRV] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap~~ [AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap Apr 17, 2026

duburcqa force-pushed the duburcqa/fix_adstack_perf branch from a231bc8 to ff11ee2 Compare April 17, 2026 15:57

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 816900b to 3f8a0fe Compare April 17, 2026 15:57

claude Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

Comment thread quadrants/codegen/spirv/kernel_utils.h Outdated

duburcqa changed the title ~~[AutoDiff] Autodiff 9: Heap-backed adstack: lift the Metal/Vulkan per-thread capacity cap~~ [AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan) Apr 21, 2026

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from a619832 to cd038de Compare April 21, 2026 06:59

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 5975415 to 19ab524 Compare April 21, 2026 06:59

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from cd038de to 04d17cd Compare April 21, 2026 07:19

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 19ab524 to e5fbeb1 Compare April 21, 2026 07:20

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp Outdated

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 04d17cd to 4f43e77 Compare April 21, 2026 08:18

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from e5fbeb1 to 7d3d73a Compare April 21, 2026 08:18

duburcqa mentioned this pull request Apr 21, 2026

[AutoDiff] Autodiff 14: Bound GfxRuntime::ctx_buffers_ retirement queue across flush() calls #538

Closed

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 4f43e77 to 10e5547 Compare April 21, 2026 08:36

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 7d3d73a to 1640c78 Compare April 21, 2026 08:36

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/runtime/gfx/runtime.cpp

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 10e5547 to 35b25a4 Compare April 21, 2026 09:51

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 1640c78 to 411dd48 Compare April 21, 2026 09:51

duburcqa mentioned this pull request Apr 21, 2026

[AutoDiff] Autodiff 14: Resolve bounded-inner-loop adstacks without default_ad_stack_size fallback #539

Merged

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 35b25a4 to 3a3e58c Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 411dd48 to 027c733 Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 3a3e58c to 98f2246 Compare April 21, 2026 13:24

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from 027c733 to be6cdb3 Compare April 21, 2026 13:24

duburcqa force-pushed the duburcqa/split_adstack_llvm_heap branch from 98f2246 to c625fc5 Compare April 21, 2026 14:42

duburcqa force-pushed the duburcqa/heap_backed_adstack branch from be6cdb3 to d569cc5 Compare April 21, 2026 14:42

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread tests/python/test_adstack.py

duburcqa mentioned this pull request Apr 22, 2026

[AutoDiff] Autodiff 16: Resolve reverse-mode adstack depths per-launch via runtime-evaluated SizeExpr #543

Merged

duburcqa added 2 commits April 24, 2026 09:26

[AutoDiff] Heap-backed adstack on SPIR-V backends (Metal, Vulkan)

ed9d007

[AutoDiff/Tests] Widen arch restrictions on heap-backed adstack tests…

dd68d7e

… (backend-aware i8 rejection, all archs for ndrange sizing)

This was referenced Apr 29, 2026

DO NOT MERGE: debug auto-diff on Apple M1. #589

Closed

[AutoDiff] Cut reverse-mode adstack memory usage 10x on all backends #599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan)#493

[AutoDiff] Autodiff 13: Heap-backed adstack on SPIR-V backends (Metal, Vulkan)#493
duburcqa merged 2 commits intomainfrom
duburcqa/heap_backed_adstack

duburcqa commented Apr 17, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duburcqa commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Heap-backed adstack on SPIR-V backends (Metal, Vulkan)

TL;DR

Why

Mechanism

New buffer types (quadrants/codegen/spirv/kernel_utils.{h,cpp})

Pre-scan + eager base emission (spirv_codegen.cpp)

u32 vs u64 index arithmetic

Hard-error non-{f32, i32, u1} types

Runtime heap growth (runtime/gfx/runtime.{h,cpp})

advisory_total_num_threads tightening

Metal allocate_memory returns out_of_memory on nil

Docs (docs/source/user_guide/autodiff.md)

Tests

Side-effect audit

Stack

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented Apr 17, 2026 •

edited

Loading

New buffer types (`quadrants/codegen/spirv/kernel_utils.{h,cpp}`)

Pre-scan + eager base emission (`spirv_codegen.cpp`)

Runtime heap growth (`runtime/gfx/runtime.{h,cpp}`)

`advisory_total_num_threads` tightening

Metal `allocate_memory` returns `out_of_memory` on nil

Docs (`docs/source/user_guide/autodiff.md`)