Perf/async hip memcpy l1 by gpinkert · Pull Request #9 · ROCm/quadrants

gpinkert · 2026-04-23T00:13:25Z

TL;DR

**Uniform +5% to +15% benchmark boost

…nel launcher Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path. Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can. Changes: * New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers. * Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution. * Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned). * Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op. Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers). Benchmark (Genesis, 8192 envs, 500 steps, fp32, MI300X), avg of 5 runs: wall_time=5.88s FPS=85.0 throughput=696,334 env*steps/s vs baseline (608,418 env*steps/s, wall_time=6.76s): +14.4% throughput Prior art: structurally equivalent to Phase 1 (async memcpy) and Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR #9 (perf/async-hip-memcpy-l1), landed independently on the AMD integration branch. Async-memcpy idea from Hugh Perkins' hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for the full lineage. Co-Authored-By: Grant Pinkert <gpinkert@amd.com> Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nel launcher Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path. Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can. Changes: * New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers. * Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution. * Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned). * Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op. Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers). Prior art: structurally equivalent to Phase 1 (async memcpy) and Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR #9 (perf/async-hip-memcpy-l1), landed independently on the AMD integration branch. Async-memcpy idea from Hugh Perkins' hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for the full lineage. Co-Authored-By: Grant Pinkert <gpinkert@amd.com> Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… launcher Convert the synchronous hipMemcpy calls in the AMDGPU kernel launcher to their async counterparts so the arg-buffer H2D and result-buffer D2H transfers can overlap with kernel execution and driver bookkeeping instead of blocking the host. Mirrors what runtime/cuda/kernel_launcher.cpp already does. Concretely: * arg-buffer: memcpy_host_to_device -> memcpy_host_to_device_async * result-buffer: memcpy_device_to_host -> memcpy_device_to_host_async * result-buffer free is moved before the external-array transfers-back block so it is batched into the stream pipeline (same ordering as CUDA launcher). * a single stream_synchronize is added before the transfers-back loop so the readbacks see consistent device state. The transfers are still enqueued on the default stream (nullptr) -- a separate change can route them to a per-thread stream once AMDGPUContext gains a stream_ member (cf. hp/streams-quadrantsic-2-amdgpu-cpu). Background: __amd_rocclr_copyBuffer dominates the per-kernel-launch critical path on MI210 because the AMDGPU launcher was issuing two synchronous hipMemcpy calls per launch. Switching to async does not reduce the call count, but removes the blocking semantics that were serialising every launch with two driver round-trips. Attribution: this change is structurally equivalent to two prior commits by Hugh Perkins on the hp/streams-quadrantsic-2-amdgpu-cpu branch (March 11, 2026): * c12d23e "Convert AMDGPU sync memcpy_host_to_device to async on active_stream" * 7555ec5 "Move AMDGPU mem_free_async before transfers sync to match CUDA ordering" Hugh's commits target his streams-pool's per-thread active_stream; this commit applies the same idea to the default stream (nullptr) so it can land independently of the streams-pool work (which has a separate use-after-free issue and is not included in this PR). Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: gpinkert <gpinkert@amd.com>

Reuse the device-side scratch buffer that holds the kernel's result_buffer (and serves as the sink for executor->allocate_memory_on_device calls) across launches instead of doing malloc_async/mem_free_async every time. Per-launch overhead chain dropped by 2 driver calls (alloc + free), which also eliminates two implicit serializations on the default stream and reduces __amd_rocclr_copyBuffer-adjacent CLR machinery work. The buffer is grow-on-demand, never freed per-launch, and intentionally not destroyed in the KernelLauncher destructor (AMDGPU context teardown timing during interpreter shutdown is unsafe; OS reclaims at process exit). Stream-ordering safety: the previous launch's async D->H read from this buffer is queued on the default stream before any host read of host_result_buffer (which always happens behind a stream sync). The next launch's host_to_device write to the buffer is also enqueued on the default stream after the prior copy, so reuse is serialized correctly. Stack: this is Phase 2a of the __amd_rocclr_copyBuffer optimization campaign, on top of perf/async-hip-memcpy-l1 (ada1545). Co-Authored-By: gpinkert <gpinkert@amd.com>

Same pattern as the device_result_buffer cache from the previous commit. Eliminates the second per-launch hipMallocAsync/hipFreeAsync pair (when arg_buffer_size > 0, which is most kernels with parameters). Stream-ordering safety: the H2D copy that fills the arg buffer is on the default stream, the kernel reads it on the same stream, and the next launch's H2D is also on the default stream — all serialized. The cached buffer grows on demand and is never freed per-launch. Stack: Phase 2b on top of Phase 2a (e6fce2c). Co-Authored-By: gpinkert <gpinkert@amd.com>

…kernels) Most kernels in the Genesis rigid-physics step loop are void with no host->device transfers. With the cached-buffer scheme from Phase 2a we were still touching the cache on every launch (capacity check) and on the first launch we'd unconditionally hipMallocAsync 8 bytes even when never used. Phase 2c flips the cache to lazy: device_result_buffer starts as nullptr each launch and is only ensure()'d when: (a) a host-array transfer needs an allocate_memory_on_device sink (8-byte slot inside the param loop), or (b) ctx.result_buffer_size > 0 (sized to that value). For the common void-kernel-no-transfers case, the entire device-result codepath (alloc + write + readback + free) is bypassed. Stack: Phase 2c on top of Phase 2b (59cf50e). Co-Authored-By: gpinkert <gpinkert@amd.com>

The cached per-handle device_arg_buffer / device_result_buffer added in the previous Phase-2 commits were intentionally never freed per-launch, to remove malloc_async/mem_free_async from the per-kernel-launch hot path. The original commit messages noted that the OS would reclaim them at process exit. That left a small functional regression at the gs.destroy() boundary: before this PR, every per-launch malloc_async was paired with a mem_free_async, so by the time gs.destroy() fired, all per-launch GPU memory we'd allocated was gone. With Phase 2a/b/c, the cached buffers survived gs.destroy() and persisted until process exit. Bounded (N_kernels * (max_arg + max_result), typically <1 MB for Genesis), but non-zero — and a real change in destroy() semantics for users that re-init within a single process. This commit adds a destructor to amdgpu::KernelLauncher that walks the per-handle Context vector and mem_free_async's each cached buffer. The destructor runs when ProgramImpl::kernel_launcher_ unique_ptr is reset during the normal gs.destroy() flow, which happens while the AMDGPU context (a process-singleton) is still alive — so the frees go through cleanly and gs.destroy() once again reclaims every byte we allocated. For the unusual case where the user never calls gs.destroy() and the launcher reaches static destruction at process exit, the AMDGPU context may already be torn down. The mem_free_async calls are best-effort in that case; they may error or be silently dropped. The OS reclaims the device memory at process exit anyway, so the worst case is benign. Hot-path impact: zero. The destructor runs only at launcher tear-down, not on any per-launch or per-step path. Stack: Phase 2d on top of Phase 2c (e483b57).

gpinkert · 2026-04-24T23:21:02Z

/run-ci

gpinkert force-pushed the perf/async-hip-memcpy-l1 branch from 4090a7d to e483b57 Compare April 23, 2026 00:14

lohiaj mentioned this pull request Apr 23, 2026

[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15) ROCm/Genesis#16

Closed

jamesETsmith mentioned this pull request Apr 23, 2026

[PERF] Turning on parallel init for AMDGPUs and enabling AMDGPU tests #3

Merged

2 tasks

gpinkert force-pushed the perf/async-hip-memcpy-l1 branch from dfb74f8 to e686690 Compare April 24, 2026 07:10

Claude (cherry-pick agent) and others added 5 commits April 24, 2026 19:15

gpinkert force-pushed the perf/async-hip-memcpy-l1 branch from e686690 to fd6c95a Compare April 24, 2026 23:20

gpinkert closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/async hip memcpy l1#9

Perf/async hip memcpy l1#9
gpinkert wants to merge 5 commits intoamd-integrationfrom
perf/async-hip-memcpy-l1

gpinkert commented Apr 23, 2026 •

edited by jamesETsmith

Loading

Uh oh!

gpinkert commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gpinkert commented Apr 23, 2026 • edited by jamesETsmith Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Uh oh!

gpinkert commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gpinkert commented Apr 23, 2026 •

edited by jamesETsmith

Loading