Perf/async hip memcpy l1#9
Closed
gpinkert wants to merge 5 commits intoamd-integrationfrom
Closed
Conversation
4090a7d to
e483b57
Compare
2 tasks
dfb74f8 to
e686690
Compare
peizhang56
added a commit
that referenced
this pull request
Apr 24, 2026
…nel launcher Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path. Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can. Changes: * New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers. * Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution. * Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned). * Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op. Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers). Benchmark (Genesis, 8192 envs, 500 steps, fp32, MI300X), avg of 5 runs: wall_time=5.88s FPS=85.0 throughput=696,334 env*steps/s vs baseline (608,418 env*steps/s, wall_time=6.76s): +14.4% throughput Prior art: structurally equivalent to Phase 1 (async memcpy) and Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR #9 (perf/async-hip-memcpy-l1), landed independently on the AMD integration branch. Async-memcpy idea from Hugh Perkins' hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for the full lineage. Co-Authored-By: Grant Pinkert <gpinkert@amd.com> Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
peizhang56
added a commit
that referenced
this pull request
Apr 24, 2026
…nel launcher Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path. Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can. Changes: * New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers. * Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution. * Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned). * Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op. Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers). Prior art: structurally equivalent to Phase 1 (async memcpy) and Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR #9 (perf/async-hip-memcpy-l1), landed independently on the AMD integration branch. Async-memcpy idea from Hugh Perkins' hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for the full lineage. Co-Authored-By: Grant Pinkert <gpinkert@amd.com> Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… launcher
Convert the synchronous hipMemcpy calls in the AMDGPU kernel launcher to
their async counterparts so the arg-buffer H2D and result-buffer D2H
transfers can overlap with kernel execution and driver bookkeeping
instead of blocking the host. Mirrors what runtime/cuda/kernel_launcher.cpp
already does.
Concretely:
* arg-buffer: memcpy_host_to_device -> memcpy_host_to_device_async
* result-buffer: memcpy_device_to_host -> memcpy_device_to_host_async
* result-buffer free is moved before the external-array transfers-back
block so it is batched into the stream pipeline (same ordering as
CUDA launcher).
* a single stream_synchronize is added before the transfers-back loop
so the readbacks see consistent device state.
The transfers are still enqueued on the default stream (nullptr) -- a
separate change can route them to a per-thread stream once
AMDGPUContext gains a stream_ member (cf. hp/streams-quadrantsic-2-amdgpu-cpu).
Background: __amd_rocclr_copyBuffer dominates the per-kernel-launch
critical path on MI210 because the AMDGPU launcher was issuing two
synchronous hipMemcpy calls per launch. Switching to async does not
reduce the call count, but removes the blocking semantics that were
serialising every launch with two driver round-trips.
Attribution: this change is structurally equivalent to two prior
commits by Hugh Perkins on the hp/streams-quadrantsic-2-amdgpu-cpu
branch (March 11, 2026):
* c12d23e "Convert AMDGPU sync memcpy_host_to_device to async on
active_stream"
* 7555ec5 "Move AMDGPU mem_free_async before transfers sync to
match CUDA ordering"
Hugh's commits target his streams-pool's per-thread active_stream;
this commit applies the same idea to the default stream (nullptr) so
it can land independently of the streams-pool work (which has a
separate use-after-free issue and is not included in this PR).
Co-Authored-By: Hugh Perkins <hughperkins@gmail.com>
Co-Authored-By: gpinkert <gpinkert@amd.com>
Reuse the device-side scratch buffer that holds the kernel's result_buffer (and serves as the sink for executor->allocate_memory_on_device calls) across launches instead of doing malloc_async/mem_free_async every time. Per-launch overhead chain dropped by 2 driver calls (alloc + free), which also eliminates two implicit serializations on the default stream and reduces __amd_rocclr_copyBuffer-adjacent CLR machinery work. The buffer is grow-on-demand, never freed per-launch, and intentionally not destroyed in the KernelLauncher destructor (AMDGPU context teardown timing during interpreter shutdown is unsafe; OS reclaims at process exit). Stream-ordering safety: the previous launch's async D->H read from this buffer is queued on the default stream before any host read of host_result_buffer (which always happens behind a stream sync). The next launch's host_to_device write to the buffer is also enqueued on the default stream after the prior copy, so reuse is serialized correctly. Stack: this is Phase 2a of the __amd_rocclr_copyBuffer optimization campaign, on top of perf/async-hip-memcpy-l1 (ada1545). Co-Authored-By: gpinkert <gpinkert@amd.com>
Same pattern as the device_result_buffer cache from the previous commit. Eliminates the second per-launch hipMallocAsync/hipFreeAsync pair (when arg_buffer_size > 0, which is most kernels with parameters). Stream-ordering safety: the H2D copy that fills the arg buffer is on the default stream, the kernel reads it on the same stream, and the next launch's H2D is also on the default stream — all serialized. The cached buffer grows on demand and is never freed per-launch. Stack: Phase 2b on top of Phase 2a (e6fce2c). Co-Authored-By: gpinkert <gpinkert@amd.com>
…kernels)
Most kernels in the Genesis rigid-physics step loop are void with no
host->device transfers. With the cached-buffer scheme from Phase 2a we
were still touching the cache on every launch (capacity check) and on the
first launch we'd unconditionally hipMallocAsync 8 bytes even when never
used.
Phase 2c flips the cache to lazy: device_result_buffer starts as nullptr
each launch and is only ensure()'d when:
(a) a host-array transfer needs an allocate_memory_on_device sink
(8-byte slot inside the param loop), or
(b) ctx.result_buffer_size > 0 (sized to that value).
For the common void-kernel-no-transfers case, the entire device-result
codepath (alloc + write + readback + free) is bypassed.
Stack: Phase 2c on top of Phase 2b (59cf50e).
Co-Authored-By: gpinkert <gpinkert@amd.com>
The cached per-handle device_arg_buffer / device_result_buffer added in the previous Phase-2 commits were intentionally never freed per-launch, to remove malloc_async/mem_free_async from the per-kernel-launch hot path. The original commit messages noted that the OS would reclaim them at process exit. That left a small functional regression at the gs.destroy() boundary: before this PR, every per-launch malloc_async was paired with a mem_free_async, so by the time gs.destroy() fired, all per-launch GPU memory we'd allocated was gone. With Phase 2a/b/c, the cached buffers survived gs.destroy() and persisted until process exit. Bounded (N_kernels * (max_arg + max_result), typically <1 MB for Genesis), but non-zero — and a real change in destroy() semantics for users that re-init within a single process. This commit adds a destructor to amdgpu::KernelLauncher that walks the per-handle Context vector and mem_free_async's each cached buffer. The destructor runs when ProgramImpl::kernel_launcher_ unique_ptr is reset during the normal gs.destroy() flow, which happens while the AMDGPU context (a process-singleton) is still alive — so the frees go through cleanly and gs.destroy() once again reclaims every byte we allocated. For the unusual case where the user never calls gs.destroy() and the launcher reaches static destruction at process exit, the AMDGPU context may already be torn down. The mem_free_async calls are best-effort in that case; they may error or be silently dropped. The OS reclaims the device memory at process exit anyway, so the worst case is benign. Hot-path impact: zero. The destructor runs only at launcher tear-down, not on any per-launch or per-step path. Stack: Phase 2d on top of Phase 2c (e483b57).
e686690 to
fd6c95a
Compare
Author
|
/run-ci |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
**Uniform +5% to +15% benchmark boost