Skip to content

[AutoDiff] Autodiff 14: Bound GfxRuntime::ctx_buffers_ retirement queue across flush() calls#538

Closed
duburcqa wants to merge 1 commit intoduburcqa/heap_backed_adstackfrom
duburcqa/split_gfx_ctx_buffers_retirement
Closed

[AutoDiff] Autodiff 14: Bound GfxRuntime::ctx_buffers_ retirement queue across flush() calls#538
duburcqa wants to merge 1 commit intoduburcqa/heap_backed_adstackfrom
duburcqa/split_gfx_ctx_buffers_retirement

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

Bound GfxRuntime::ctx_buffers_ retirement queue across flush() calls

Follow-up to #536. Caps the deferred-free memory that can accumulate across flush() calls without an intervening synchronize(). Bounded-wait approximation of the Ideal semaphore-keyed retirement pattern (non-blocking poll). The truly non-blocking variant is deferred to a per-backend RHI change.

TL;DR

#536 fixed a use-after-free on the SPIR-V side by leaving ctx_buffers_ alone in flush() (only clearing it in synchronize() after wait_idle() drains the stream). codex flagged the resulting trade-off: in async workloads that call flush() repeatedly without syncing — e.g. a pipeline of submits that forward-chain semaphores to the next stage — ctx_buffers_ can grow unbounded because nothing reclaims the deferred-free entries.

// quadrants/runtime/gfx/runtime.h
static constexpr std::size_t kPendingRetirementsDepth = 3;
std::deque<std::pair<StreamSemaphore, std::vector<std::unique_ptr<DeviceAllocationGuard>>>> pending_retirements_;

// quadrants/runtime/gfx/runtime.cpp — flush()
if (pending_retirements_.size() >= kPendingRetirementsDepth) {
  device_->get_compute_stream()->command_sync();
  pending_retirements_.clear();
}
if (!ctx_buffers_.empty()) {
  pending_retirements_.emplace_back(sema, std::move(ctx_buffers_));
  ctx_buffers_.clear();
}

Each flush() snapshots the current ctx_buffers_, pairs it with the submission semaphore, and pushes the pair onto a FIFO capped at 3 entries. On the 4th flush without a synchronize(), command_sync() drains the compute stream and the FIFO is cleared. Never reclaims an in-flight allocation; never grows beyond 3 * ctx_buffers_per_flush of live-but-deferred memory.

Why the bounded-wait approximation and not the Ideal polling variant

Codex's suggestion was "per-submission retirement tied to the returned semaphore" — i.e. non-blocking poll of each semaphore's signaled status and retirement of exactly the batches that are done. That requires a bool is_signaled() const method on StreamSemaphoreObject plus per-backend implementations:

  • Vulkan: needs a VkFence associated with each submit and vkGetFenceStatus. Quadrants RHI's Vulkan backend (quadrants/rhi/vulkan/vulkan_device.{h,cpp}) today returns only a binary VkSemaphore from submit; adding fence support means either threading a parallel fence through the submit API or binding a fence to the VulkanStreamSemaphoreObject on construction. Binary-semaphore polling is not defined by the spec.
  • Metal: MTLSharedEvent.signaledValue >= targetValue polls cleanly. The current MetalStreamSemaphoreObject (quadrants/rhi/metal/metal_device.{h,mm}) wraps an id<MTLEvent>; swapping or extending to MTLSharedEvent is straightforward but still an RHI API change.
  • CPU: trivial — the submit is synchronous, so "signaled" is always true.

None of those changes are conceptually hard, but collectively they touch the RHI public surface (quadrants/rhi/public_device.h), all three backend implementations, and every call site that threads a semaphore through. That was too large to bundle with #536 or with the heap-backed adstack PRs that actually drive ctx_buffers_ growth. This PR ships the minimum correctness-safe, bounded-growth approximation that does not require any RHI API changes. A follow-up PR can add is_signaled() and swap the command_sync() retirement line for per-entry polling.

Changes

quadrants/runtime/gfx/runtime.h

  • New std::deque<std::pair<StreamSemaphore, std::vector<DeviceAllocationUnique>>> pending_retirements_ member.
  • static constexpr std::size_t kPendingRetirementsDepth = 3.
  • Field-level comment block documenting the bounded-wait rationale and the Ideal polling path as follow-up.
  • Add #include <deque> and #include <utility> to the header.

quadrants/runtime/gfx/runtime.cpp

  • flush() — after the submit, if pending_retirements_.size() >= kPendingRetirementsDepth, call command_sync() and clear the FIFO. Then move the current ctx_buffers_ into a new pending_retirements_ entry paired with the just-returned semaphore.
  • synchronize()wait_idle() already guarantees every cmdlist that ever referenced a buffer in either collection has completed, so both pending_retirements_ and ctx_buffers_ are cleared unconditionally.

Side-effect audit

Concern Verdict
Use-after-free Impossible: pending_retirements_ entries only release their buffers after command_sync() or wait_idle().
Unbounded growth in async workloads Bounded at kPendingRetirementsDepth * ctx_buffers_per_flush entries.
Latency command_sync() fires every kPendingRetirementsDepth + 1 flushes without a user-sync. In steady-state this is a no-op; in sustained-async workloads it introduces a stall once per N flushes. Tolerable as the correctness-safe default; the polling follow-up removes the stall entirely.
Non-SPIR-V backends Unaffected — GfxRuntime is SPIR-V only. LLVM backends use LlvmRuntimeExecutor which has its own retirement story (synchronous hipFree on AMDGPU via Autodiff 11, cuMemFree_v2 auto-sync on CUDA).

Stack

Autodiff 14 of 14. Based on #493 (SPIR-V heap-backed adstack). End of the chain.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0feb34d6d2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +779 to +781
if (pending_retirements_.size() >= kPendingRetirementsDepth) {
device_->get_compute_stream()->command_sync();
pending_retirements_.clear();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear retirement queue on all explicit stream drains

flush() only frees pending_retirements_ when the FIFO depth limit is reached, but HostDeviceContextBlitter::device_to_host() also drains the stream via device_->wait_idle() (in runtime.cpp) for readback kernels. After that wait, all queued retirement batches are already safe to release, yet they remain pinned until a future depth-triggered flush() or synchronize(). In workloads that transition from async flushes to repeated readback kernels without calling synchronize(), this keeps large deferred buffers alive indefinitely and regresses memory reclamation.

Useful? React with 👍 / 👎.

@duburcqa
Copy link
Copy Markdown
Contributor Author

Closing per discussion: the unbounded-growth path requires a workload that calls flush() without ever touching any Python-side observable (result fetch, to_numpy(), etc.), which does not exist in Quadrants today. The FIFO-depth bound in this PR trades a theoretical cost for a real, measurable command_sync() stall at an arbitrary depth threshold — net-negative against the pre-existing behaviour that is already correctness-safe after #536. The proper fix (non-blocking per-semaphore polling, bool is_signaled() const on StreamSemaphoreObject with per-backend implementations) should stand alone if it is ever motivated by a real workload, not ride in behind a bot comment. A rationale comment is being added to the pre-existing ctx_buffers_ code in #536 explaining why the current simple design is kept.

@duburcqa duburcqa closed this Apr 21, 2026
Comment on lines +739 to 740
pending_retirements_.clear();
ctx_buffers_.clear();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 After any kernel dispatch with external array readback, pending_retirements_ is not cleared despite device_to_host() having called device_->wait_idle() internally, leaving up to kPendingRetirementsDepth batches of already-GPU-complete buffers retained unnecessarily. The fix is to add pending_retirements_.clear(); alongside ctx_buffers_.clear(); in the post-device_to_host cleanup block in launch_kernel().

Extended reasoning...

What the bug is and how it manifests

This PR introduces pending_retirements_, a bounded FIFO of (semaphore, deferred-free buffers) pairs accumulated across flush() calls. When a kernel dispatch triggers the device_to_host readback path in launch_kernel(), the cleanup block clears ctx_buffers_ but omits pending_retirements_.clear(). After the call returns, up to kPendingRetirementsDepth batches of GPU-complete buffers remain pinned in pending_retirements_ until the next synchronize() or FIFO overflow.

The specific code path that triggers it

Inside HostDeviceContextBlitter::device_to_host(), when readback_sizes.size() > 0 (i.e. the kernel has external array write-access), line 195 calls device_->wait_idle() — the same global stream drain that synchronize() relies on. After wait_idle() returns, every cmdlist ever submitted is complete and every buffer in pending_retirements_ is safe to free. Control then returns to launch_kernel(), which executes:

current_cmdlist_ = nullptr;
ctx_buffers_.clear(); // present
// pending_retirements_.clear() is MISSING

Why existing code does not prevent it

synchronize() (lines 739-740 of the diff) correctly clears both pending_retirements_ and ctx_buffers_ after its own wait_idle(). The analogous cleanup site in launch_kernel() was simply not updated when pending_retirements_ was added by this PR. Before this PR only ctx_buffers_ existed and was correctly cleared in the device-to-host path — the regression is the missing second clear.

What the impact is

In any workload that dispatches kernels with external array readback (e.g. a training loop that uses host numpy arrays as loss inputs/outputs), each dispatched kernel that hits the readback path leaves up to kPendingRetirementsDepth * ctx_buffers_per_flush deferred-free GPU allocations alive after the stream has fully drained. The memory is bounded (not a leak) and is not a use-after-free (the allocations remain valid), but it accumulates across the training loop until the user calls an explicit qd.sync() or until the FIFO overflows on the next flush().

How to fix it

Add pending_retirements_.clear(); to the post-device_to_host cleanup in launch_kernel():

if (ctx_blitter->device_to_host(current_cmdlist_.get(), any_arrays, ext_array_grads, ext_array_size)) {
current_cmdlist_ = nullptr;
pending_retirements_.clear(); // add this
ctx_buffers_.clear();
}

Step-by-step proof

  1. Kernel K1 is launched; flush() is called — ctx_buffers_ (containing the old adstack heap buffer after a grow) is snapshotted into pending_retirements_[0] paired with semaphore S1.
  2. Kernel K2 is launched with an external array argument that has WRITE access. device_to_host() calls device_->wait_idle(), fully draining the stream. S1 has signaled; the buffer in pending_retirements_[0] is safe to free.
  3. The post-device_to_host cleanup sets current_cmdlist_ = nullptr and calls ctx_buffers_.clear() — but pending_retirements_ still holds the batch from step 1.
  4. Kernels K3-K5 are dispatched without readback; each flush() pushes another entry. After K5's flush pending_retirements_.size() == kPendingRetirementsDepth == 3 and the FIFO overflow path finally calls command_sync() + pending_retirements_.clear(), releasing the buffer from step 1 approximately 3 flushes later than necessary.

Comment on lines +153 to +154
// submission batch. On the `kPendingRetirementsDepth + 1`-th flush without a sync, the oldest pair is drained
// via `wait_semaphore()` before the new one is pushed — the FIFO depth acts as a hard cap on how many queued
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two comments in this PR describe a per-semaphore wait-and-pop-front retirement mechanism that does not exist: runtime.h:154 says the oldest pair is "drained via wait_semaphore()" (no such method exists on the RHI), and runtime.cpp:772–773 says the code "block[s] on the oldest entry's semaphore, drain[s] it, and pop[s] it" (single-slot pop-front). The actual implementation calls command_sync() — a full vkQueueWaitIdle/MTL-equivalent stream drain — followed by pending_retirements_.clear(), which wipes the entire FIFO, not just the front entry. Future developers implementing the promised non-blocking polling follow-up will be misled into thinking per-semaphore waiting and targeted pop_front retirement are already in place.

Extended reasoning...

What the bug is. The PR introduces a bounded FIFO (pending_retirements_) to cap deferred-free memory across repeated flush() calls. Two comments describing the overflow-handling path use fictional mechanism names that do not match the code: (1) runtime.h:154 says the oldest entry is "drained via wait_semaphore()"; (2) runtime.cpp:772–773 says "we block on the oldest entry's semaphore, drain it, and pop it before pushing the new one." Both imply a per-semaphore wait on the front deque entry followed by a single pop_front retirement.

Concrete code path. When flush() runs with pending_retirements_.size() >= kPendingRetirementsDepth, the actual code executes:

device_->get_compute_stream()->command_sync();  // full stream drain
pending_retirements_.clear();                    // wipes entire FIFO

There is no wait_semaphore() method anywhere in the RHI public surface (public_device.h exposes only command_sync(), submit(wait_semaphores={}), and submit_synced(wait_semaphores={})). The StreamSemaphore stored in each pair is not consumed; it is simply destroyed when the pair is erased by clear().

Why existing code does not prevent it. The comments were written to describe the Ideal polling follow-up behavior (per-semaphore non-blocking retirement) rather than the approximation actually implemented. They were presumably written as aspirational documentation and not updated when the implementation was finalised. The code itself is functionally correct; the mismatch is purely in the comments.

Impact. Any developer following the commit log to implement the promised is_signaled() / polling follow-up will read both comments and assume (a) a wait_semaphore() method should exist or be added, and (b) only the front entry needs to be retired when the FIFO is full. They will not realise the current code does a full-stream drain clearing all entries until they step through it, wasting investigation time or worse, designing an incremental patch on top of a false model.

Proof by example. Suppose kPendingRetirementsDepth = 3 and four consecutive flush() calls each push one entry:

  • Flush 1–3: entries pushed; FIFO = [A, B, C], size() == 3.
  • Flush 4: size() >= 3, so code calls command_sync() (waits for all three A/B/C submissions to complete, not just A's semaphore), then calls pending_retirements_.clear() (removes A, B, and C). The comment's narrative — "block on the oldest entry's semaphore [A], drain it, and pop it" — would leave [B, C] intact. The actual outcome leaves the FIFO empty.

How to fix. Update both comments to accurately describe what happens: command_sync() performs a full compute-stream drain (equivalent to vkQueueWaitIdle/MTLWaitUntilCompleted), and pending_retirements_.clear() retires all queued entries at once, not just the oldest. The comments should also drop references to wait_semaphore().

On the refutation (bug_005 = duplicate of bug_003). The refutation notes that a separately-filed bug_003 also covers the flush() comment in runtime.cpp. Since this merged report intentionally spans both the runtime.h field comment (bug_001) and the runtime.cpp inline comment (bug_005) as two distinct manifestations of the same false narrative, the duplication concern applies only to the .cpp half. The .h half (referencing the non-existent wait_semaphore() method) is a separate location and a separate misconception conveyed to readers of the header alone. The combined report is therefore non-redundant as a whole.

@duburcqa duburcqa deleted the duburcqa/split_gfx_ctx_buffers_retirement branch April 21, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant