perf(amdgpu): replace per-launch hipMallocAsync/Free with persistent …#13
perf(amdgpu): replace per-launch hipMallocAsync/Free with persistent …#13gpinkert merged 1 commit intoamd-integrationfrom
Conversation
6bb8e74 to
445b29e
Compare
jamesETsmith
left a comment
There was a problem hiding this comment.
One quick question here and then we'll need to rerun the CI after #3 gets merged. This looks good @peizhang56
| @@ -234,7 +235,6 @@ void KernelLauncher::launch_llvm_kernel(Handle handle, | |||
| executor->deallocate_memory_on_device(itr->second.second); | |||
| } | |||
| } | |||
There was a problem hiding this comment.
This could return early if transfers.size() is zero right? Idk how likely this is in practice, but we should probably add an else and a sync here right?
There was a problem hiding this comment.
Updated with a fix: if transfers.size() is empty, then we check if result needs to be synced or not. We need to sync if memcpy_device_to_host_async is invoked.
445b29e to
9a70d5f
Compare
…nel launcher Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path. Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can. Changes: * New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers. * Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution. * Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned). * Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op. Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers). Prior art: structurally equivalent to Phase 1 (async memcpy) and Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR #9 (perf/async-hip-memcpy-l1), landed independently on the AMD integration branch. Async-memcpy idea from Hugh Perkins' hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for the full lineage. Co-Authored-By: Grant Pinkert <gpinkert@amd.com> Co-Authored-By: Hugh Perkins <hughperkins@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
50ae365 to
0e46d9a
Compare
|
/run_ci |
| DeviceScratchBuffer(DeviceScratchBuffer &&other) noexcept | ||
| : stream_(other.stream_), | ||
| ptr_(other.ptr_), | ||
| capacity_(other.capacity_) { | ||
| other.ptr_ = nullptr; | ||
| other.capacity_ = 0; | ||
| } | ||
|
|
||
| DeviceScratchBuffer &operator=(DeviceScratchBuffer &&other) noexcept { | ||
| if (this != &other) { | ||
| release(); | ||
| stream_ = other.stream_; | ||
| ptr_ = other.ptr_; | ||
| capacity_ = other.capacity_; | ||
| other.ptr_ = nullptr; | ||
| other.capacity_ = 0; |
There was a problem hiding this comment.
shouldn't these two functions pretty much the same
| DeviceScratchBuffer &operator=(const DeviceScratchBuffer &) = delete; | ||
|
|
||
| // Move preserves stream affinity: the destination operates on the same | ||
| // stream the source was bound to. The source is left in a valid empty |
There was a problem hiding this comment.
Technically it is an "valid but unspecified state"
| // stream the source was bound to. The source is left in a valid empty | ||
| // state on its (unchanged) stream. | ||
| DeviceScratchBuffer(DeviceScratchBuffer &&other) noexcept | ||
| : stream_(other.stream_), |
There was a problem hiding this comment.
I think the other.stream_ is still alive in this case
| // | ||
| // The returned pointer is invalidated by any subsequent ensure() call | ||
| // that grows the buffer; callers must not retain it across launches. | ||
| char *ensure(std::size_t min_bytes) { |
There was a problem hiding this comment.
Is this ensuring that there are min_bytes on the internal buffer? And then returning the ptr with at least that many bytes? Maybe a better name would be t
|
|
||
| private: | ||
| void *stream_{nullptr}; | ||
| char *ptr_{nullptr}; |
There was a problem hiding this comment.
Why is the type a char? would a uint8_t or a std::byte be more clear?
There was a problem hiding this comment.
It is char, because caller was using the char*, technically, it should be void*
| if (ptr_ != nullptr) { | ||
| AMDGPUDriver::get_instance().mem_free_async(ptr_, stream_); | ||
| ptr_ = nullptr; | ||
| } | ||
| AMDGPUDriver::get_instance().malloc_async( | ||
| reinterpret_cast<void **>(&ptr_), min_bytes, stream_); | ||
| capacity_ = min_bytes; | ||
| } |
There was a problem hiding this comment.
void* new_ptr = nullptr;
AMDGPUDriver::get_instance().malloc_async(&new_ptr, min_bytes, stream_);
if (ptr_ != nullptr) {
AMDGPUDriver::get_instance().mem_free_async(ptr_, stream_);
}
ptr_ = static_cast<char*>(new_ptr);
capacity_ = min_bytes;
This way we keep the previous allocation in case the new malloc_async failed.
|
/run-ci |


Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path.
Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can.
Changes:
Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers).