perf(amdgpu): replace per-launch hipMallocAsync/Free with persistent … by peizhang56 · Pull Request #13 · ROCm/quadrants

peizhang56 · 2026-04-24T06:51:35Z

Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path.

Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can.

Changes:

New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers.
Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution.
Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned).
Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op.

Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers).

peizhang56 · 2026-04-24T07:44:55Z

jamesETsmith

One quick question here and then we'll need to rerun the CI after #3 gets merged. This looks good @peizhang56

jamesETsmith · 2026-04-24T18:10:02Z

@@ -234,7 +235,6 @@ void KernelLauncher::launch_llvm_kernel(Handle handle,
      executor->deallocate_memory_on_device(itr->second.second);
    }
  }


This could return early if transfers.size() is zero right? Idk how likely this is in practice, but we should probably add an else and a sync here right?

Updated with a fix: if transfers.size() is empty, then we check if result needs to be synced or not. We need to sync if memcpy_device_to_host_async is invoked.

…nel launcher Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path. Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can. Changes: * New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers. * Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution. * Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned). * Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op. Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers). Prior art: structurally equivalent to Phase 1 (async memcpy) and Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR #9 (perf/async-hip-memcpy-l1), landed independently on the AMD integration branch. Async-memcpy idea from Hugh Perkins' hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for the full lineage. Co-Authored-By: Grant Pinkert <gpinkert@amd.com> Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

peizhang56 · 2026-04-24T21:57:17Z

/run_ci

gpinkert · 2026-04-24T22:53:57Z

+  DeviceScratchBuffer(DeviceScratchBuffer &&other) noexcept
+      : stream_(other.stream_),
+        ptr_(other.ptr_),
+        capacity_(other.capacity_) {
+    other.ptr_ = nullptr;
+    other.capacity_ = 0;
+  }
+
+  DeviceScratchBuffer &operator=(DeviceScratchBuffer &&other) noexcept {
+    if (this != &other) {
+      release();
+      stream_ = other.stream_;
+      ptr_ = other.ptr_;
+      capacity_ = other.capacity_;
+      other.ptr_ = nullptr;
+      other.capacity_ = 0;


shouldn't these two functions pretty much the same

gpinkert · 2026-04-24T22:55:13Z

+  DeviceScratchBuffer &operator=(const DeviceScratchBuffer &) = delete;
+
+  // Move preserves stream affinity: the destination operates on the same
+  // stream the source was bound to. The source is left in a valid empty


Technically it is an "valid but unspecified state"

gpinkert · 2026-04-24T22:56:20Z

+  // stream the source was bound to. The source is left in a valid empty
+  // state on its (unchanged) stream.
+  DeviceScratchBuffer(DeviceScratchBuffer &&other) noexcept
+      : stream_(other.stream_),


I think the other.stream_ is still alive in this case

gpinkert · 2026-04-24T22:59:18Z

+  //
+  // The returned pointer is invalidated by any subsequent ensure() call
+  // that grows the buffer; callers must not retain it across launches.
+  char *ensure(std::size_t min_bytes) {


Is this ensuring that there are min_bytes on the internal buffer? And then returning the ptr with at least that many bytes? Maybe a better name would be t

gpinkert · 2026-04-24T23:00:08Z

+
+ private:
+  void *stream_{nullptr};
+  char *ptr_{nullptr};


Why is the type a char? would a uint8_t or a std::byte be more clear?

It is char, because caller was using the char*, technically, it should be void*

gpinkert · 2026-04-24T23:05:58Z

+      if (ptr_ != nullptr) {
+        AMDGPUDriver::get_instance().mem_free_async(ptr_, stream_);
+        ptr_ = nullptr;
+      }
+      AMDGPUDriver::get_instance().malloc_async(
+          reinterpret_cast<void **>(&ptr_), min_bytes, stream_);
+      capacity_ = min_bytes;
+    }


void* new_ptr = nullptr; AMDGPUDriver::get_instance().malloc_async(&new_ptr, min_bytes, stream_); if (ptr_ != nullptr) { AMDGPUDriver::get_instance().mem_free_async(ptr_, stream_); } ptr_ = static_cast<char*>(new_ptr); capacity_ = min_bytes;

This way we keep the previous allocation in case the new malloc_async failed.

jamesETsmith · 2026-04-25T01:13:06Z

/run-ci

peizhang56 force-pushed the perf/kernel-gap-fixes branch 4 times, most recently from 6bb8e74 to 445b29e Compare April 24, 2026 07:36

jamesETsmith requested changes Apr 24, 2026

View reviewed changes

peizhang56 force-pushed the perf/kernel-gap-fixes branch from 445b29e to 9a70d5f Compare April 24, 2026 20:08

peizhang56 force-pushed the perf/kernel-gap-fixes branch 2 times, most recently from 50ae365 to 0e46d9a Compare April 24, 2026 21:46

gpinkert reviewed Apr 24, 2026

View reviewed changes

gpinkert merged commit 34b3712 into amd-integration Apr 25, 2026
38 of 46 checks passed

Conversation

peizhang56 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peizhang56 commented Apr 24, 2026

Uh oh!

jamesETsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peizhang56 commented Apr 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gpinkert Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesETsmith commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peizhang56 commented Apr 24, 2026 •

edited

Loading

gpinkert Apr 24, 2026 •

edited

Loading