[AutoDiff] Autodiff 14: Bound GfxRuntime::ctx_buffers_ retirement queue across flush() calls#538
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0feb34d6d2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (pending_retirements_.size() >= kPendingRetirementsDepth) { | ||
| device_->get_compute_stream()->command_sync(); | ||
| pending_retirements_.clear(); |
There was a problem hiding this comment.
Clear retirement queue on all explicit stream drains
flush() only frees pending_retirements_ when the FIFO depth limit is reached, but HostDeviceContextBlitter::device_to_host() also drains the stream via device_->wait_idle() (in runtime.cpp) for readback kernels. After that wait, all queued retirement batches are already safe to release, yet they remain pinned until a future depth-triggered flush() or synchronize(). In workloads that transition from async flushes to repeated readback kernels without calling synchronize(), this keeps large deferred buffers alive indefinitely and regresses memory reclamation.
Useful? React with 👍 / 👎.
|
Closing per discussion: the unbounded-growth path requires a workload that calls |
| pending_retirements_.clear(); | ||
| ctx_buffers_.clear(); |
There was a problem hiding this comment.
🔴 After any kernel dispatch with external array readback, pending_retirements_ is not cleared despite device_to_host() having called device_->wait_idle() internally, leaving up to kPendingRetirementsDepth batches of already-GPU-complete buffers retained unnecessarily. The fix is to add pending_retirements_.clear(); alongside ctx_buffers_.clear(); in the post-device_to_host cleanup block in launch_kernel().
Extended reasoning...
What the bug is and how it manifests
This PR introduces pending_retirements_, a bounded FIFO of (semaphore, deferred-free buffers) pairs accumulated across flush() calls. When a kernel dispatch triggers the device_to_host readback path in launch_kernel(), the cleanup block clears ctx_buffers_ but omits pending_retirements_.clear(). After the call returns, up to kPendingRetirementsDepth batches of GPU-complete buffers remain pinned in pending_retirements_ until the next synchronize() or FIFO overflow.
The specific code path that triggers it
Inside HostDeviceContextBlitter::device_to_host(), when readback_sizes.size() > 0 (i.e. the kernel has external array write-access), line 195 calls device_->wait_idle() — the same global stream drain that synchronize() relies on. After wait_idle() returns, every cmdlist ever submitted is complete and every buffer in pending_retirements_ is safe to free. Control then returns to launch_kernel(), which executes:
current_cmdlist_ = nullptr;
ctx_buffers_.clear(); // present
// pending_retirements_.clear() is MISSING
Why existing code does not prevent it
synchronize() (lines 739-740 of the diff) correctly clears both pending_retirements_ and ctx_buffers_ after its own wait_idle(). The analogous cleanup site in launch_kernel() was simply not updated when pending_retirements_ was added by this PR. Before this PR only ctx_buffers_ existed and was correctly cleared in the device-to-host path — the regression is the missing second clear.
What the impact is
In any workload that dispatches kernels with external array readback (e.g. a training loop that uses host numpy arrays as loss inputs/outputs), each dispatched kernel that hits the readback path leaves up to kPendingRetirementsDepth * ctx_buffers_per_flush deferred-free GPU allocations alive after the stream has fully drained. The memory is bounded (not a leak) and is not a use-after-free (the allocations remain valid), but it accumulates across the training loop until the user calls an explicit qd.sync() or until the FIFO overflows on the next flush().
How to fix it
Add pending_retirements_.clear(); to the post-device_to_host cleanup in launch_kernel():
if (ctx_blitter->device_to_host(current_cmdlist_.get(), any_arrays, ext_array_grads, ext_array_size)) {
current_cmdlist_ = nullptr;
pending_retirements_.clear(); // add this
ctx_buffers_.clear();
}
Step-by-step proof
- Kernel K1 is launched; flush() is called — ctx_buffers_ (containing the old adstack heap buffer after a grow) is snapshotted into pending_retirements_[0] paired with semaphore S1.
- Kernel K2 is launched with an external array argument that has WRITE access. device_to_host() calls device_->wait_idle(), fully draining the stream. S1 has signaled; the buffer in pending_retirements_[0] is safe to free.
- The post-device_to_host cleanup sets current_cmdlist_ = nullptr and calls ctx_buffers_.clear() — but pending_retirements_ still holds the batch from step 1.
- Kernels K3-K5 are dispatched without readback; each flush() pushes another entry. After K5's flush pending_retirements_.size() == kPendingRetirementsDepth == 3 and the FIFO overflow path finally calls command_sync() + pending_retirements_.clear(), releasing the buffer from step 1 approximately 3 flushes later than necessary.
| // submission batch. On the `kPendingRetirementsDepth + 1`-th flush without a sync, the oldest pair is drained | ||
| // via `wait_semaphore()` before the new one is pushed — the FIFO depth acts as a hard cap on how many queued |
There was a problem hiding this comment.
🟡 Two comments in this PR describe a per-semaphore wait-and-pop-front retirement mechanism that does not exist: runtime.h:154 says the oldest pair is "drained via wait_semaphore()" (no such method exists on the RHI), and runtime.cpp:772–773 says the code "block[s] on the oldest entry's semaphore, drain[s] it, and pop[s] it" (single-slot pop-front). The actual implementation calls command_sync() — a full vkQueueWaitIdle/MTL-equivalent stream drain — followed by pending_retirements_.clear(), which wipes the entire FIFO, not just the front entry. Future developers implementing the promised non-blocking polling follow-up will be misled into thinking per-semaphore waiting and targeted pop_front retirement are already in place.
Extended reasoning...
What the bug is. The PR introduces a bounded FIFO (pending_retirements_) to cap deferred-free memory across repeated flush() calls. Two comments describing the overflow-handling path use fictional mechanism names that do not match the code: (1) runtime.h:154 says the oldest entry is "drained via wait_semaphore()"; (2) runtime.cpp:772–773 says "we block on the oldest entry's semaphore, drain it, and pop it before pushing the new one." Both imply a per-semaphore wait on the front deque entry followed by a single pop_front retirement.
Concrete code path. When flush() runs with pending_retirements_.size() >= kPendingRetirementsDepth, the actual code executes:
device_->get_compute_stream()->command_sync(); // full stream drain
pending_retirements_.clear(); // wipes entire FIFOThere is no wait_semaphore() method anywhere in the RHI public surface (public_device.h exposes only command_sync(), submit(wait_semaphores={}), and submit_synced(wait_semaphores={})). The StreamSemaphore stored in each pair is not consumed; it is simply destroyed when the pair is erased by clear().
Why existing code does not prevent it. The comments were written to describe the Ideal polling follow-up behavior (per-semaphore non-blocking retirement) rather than the approximation actually implemented. They were presumably written as aspirational documentation and not updated when the implementation was finalised. The code itself is functionally correct; the mismatch is purely in the comments.
Impact. Any developer following the commit log to implement the promised is_signaled() / polling follow-up will read both comments and assume (a) a wait_semaphore() method should exist or be added, and (b) only the front entry needs to be retired when the FIFO is full. They will not realise the current code does a full-stream drain clearing all entries until they step through it, wasting investigation time or worse, designing an incremental patch on top of a false model.
Proof by example. Suppose kPendingRetirementsDepth = 3 and four consecutive flush() calls each push one entry:
- Flush 1–3: entries pushed; FIFO = [A, B, C],
size() == 3. - Flush 4:
size() >= 3, so code callscommand_sync()(waits for all three A/B/C submissions to complete, not just A's semaphore), then callspending_retirements_.clear()(removes A, B, and C). The comment's narrative — "block on the oldest entry's semaphore [A], drain it, and pop it" — would leave [B, C] intact. The actual outcome leaves the FIFO empty.
How to fix. Update both comments to accurately describe what happens: command_sync() performs a full compute-stream drain (equivalent to vkQueueWaitIdle/MTLWaitUntilCompleted), and pending_retirements_.clear() retires all queued entries at once, not just the oldest. The comments should also drop references to wait_semaphore().
On the refutation (bug_005 = duplicate of bug_003). The refutation notes that a separately-filed bug_003 also covers the flush() comment in runtime.cpp. Since this merged report intentionally spans both the runtime.h field comment (bug_001) and the runtime.cpp inline comment (bug_005) as two distinct manifestations of the same false narrative, the duplication concern applies only to the .cpp half. The .h half (referencing the non-existent wait_semaphore() method) is a separate location and a separate misconception conveyed to readers of the header alone. The combined report is therefore non-redundant as a whole.
Bound
GfxRuntime::ctx_buffers_retirement queue acrossflush()callsTL;DR
#536 fixed a use-after-free on the SPIR-V side by leaving
ctx_buffers_alone inflush()(only clearing it insynchronize()afterwait_idle()drains the stream). codex flagged the resulting trade-off: in async workloads that callflush()repeatedly without syncing — e.g. a pipeline of submits that forward-chain semaphores to the next stage —ctx_buffers_can grow unbounded because nothing reclaims the deferred-free entries.Each
flush()snapshots the currentctx_buffers_, pairs it with the submission semaphore, and pushes the pair onto a FIFO capped at 3 entries. On the 4th flush without asynchronize(),command_sync()drains the compute stream and the FIFO is cleared. Never reclaims an in-flight allocation; never grows beyond3 * ctx_buffers_per_flushof live-but-deferred memory.Why the bounded-wait approximation and not the Ideal polling variant
Codex's suggestion was "per-submission retirement tied to the returned semaphore" — i.e. non-blocking poll of each semaphore's signaled status and retirement of exactly the batches that are done. That requires a
bool is_signaled() constmethod onStreamSemaphoreObjectplus per-backend implementations:VkFenceassociated with each submit andvkGetFenceStatus. Quadrants RHI's Vulkan backend (quadrants/rhi/vulkan/vulkan_device.{h,cpp}) today returns only a binaryVkSemaphorefrom submit; adding fence support means either threading a parallel fence through the submit API or binding a fence to theVulkanStreamSemaphoreObjecton construction. Binary-semaphore polling is not defined by the spec.MTLSharedEvent.signaledValue >= targetValuepolls cleanly. The currentMetalStreamSemaphoreObject(quadrants/rhi/metal/metal_device.{h,mm}) wraps anid<MTLEvent>; swapping or extending toMTLSharedEventis straightforward but still an RHI API change.None of those changes are conceptually hard, but collectively they touch the RHI public surface (
quadrants/rhi/public_device.h), all three backend implementations, and every call site that threads a semaphore through. That was too large to bundle with #536 or with the heap-backed adstack PRs that actually drive ctx_buffers_ growth. This PR ships the minimum correctness-safe, bounded-growth approximation that does not require any RHI API changes. A follow-up PR can addis_signaled()and swap thecommand_sync()retirement line for per-entry polling.Changes
quadrants/runtime/gfx/runtime.hstd::deque<std::pair<StreamSemaphore, std::vector<DeviceAllocationUnique>>> pending_retirements_member.static constexpr std::size_t kPendingRetirementsDepth = 3.#include <deque>and#include <utility>to the header.quadrants/runtime/gfx/runtime.cppflush()— after the submit, ifpending_retirements_.size() >= kPendingRetirementsDepth, callcommand_sync()and clear the FIFO. Then move the currentctx_buffers_into a newpending_retirements_entry paired with the just-returned semaphore.synchronize()—wait_idle()already guarantees every cmdlist that ever referenced a buffer in either collection has completed, so bothpending_retirements_andctx_buffers_are cleared unconditionally.Side-effect audit
pending_retirements_entries only release their buffers aftercommand_sync()orwait_idle().kPendingRetirementsDepth * ctx_buffers_per_flushentries.command_sync()fires everykPendingRetirementsDepth + 1flushes without a user-sync. In steady-state this is a no-op; in sustained-async workloads it introduces a stall once per N flushes. Tolerable as the correctness-safe default; the polling follow-up removes the stall entirely.GfxRuntimeis SPIR-V only. LLVM backends useLlvmRuntimeExecutorwhich has its own retirement story (synchronoushipFreeon AMDGPU via Autodiff 11,cuMemFree_v2auto-sync on CUDA).Stack
Autodiff 14 of 14. Based on #493 (SPIR-V heap-backed adstack). End of the chain.