Skip to content

perf(amdgpu): replace per-launch hipMallocAsync/Free with persistent …#13

Merged
gpinkert merged 1 commit intoamd-integrationfrom
perf/kernel-gap-fixes
Apr 25, 2026
Merged

perf(amdgpu): replace per-launch hipMallocAsync/Free with persistent …#13
gpinkert merged 1 commit intoamd-integrationfrom
perf/kernel-gap-fixes

Conversation

@peizhang56
Copy link
Copy Markdown

@peizhang56 peizhang56 commented Apr 24, 2026

Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host blocking on the AMDGPU kernel launcher hot path.

Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping overhead per call even on a primed pool. Pool tuning cannot fix it; caching at the application layer can.

Changes:

  • New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc, grow on demand, freed in the dtor. Used for arg + result buffers.
  • Arg-buffer H2D and result-buffer D2H switched to async on the default stream so they overlap kernel execution.
  • Single stream_synchronize after the async result D2H so the caller's get_ret() always reads stable host memory (independent of whether the destination is pageable or pinned).
  • Dropped the pre-kernel stream_synchronize: the only path that fed it used the synchronous H2D, which already drains the default stream, so the sync was a guaranteed no-op.

Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys, at most 1 stream_synchronize per launch (only when the host actually needs to read back results or transfers).

@peizhang56 peizhang56 force-pushed the perf/kernel-gap-fixes branch 4 times, most recently from 6bb8e74 to 445b29e Compare April 24, 2026 07:36
@peizhang56
Copy link
Copy Markdown
Author

image image

Copy link
Copy Markdown
Collaborator

@jamesETsmith jamesETsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question here and then we'll need to rerun the CI after #3 gets merged. This looks good @peizhang56

Comment on lines 227 to 237
@@ -234,7 +235,6 @@ void KernelLauncher::launch_llvm_kernel(Handle handle,
executor->deallocate_memory_on_device(itr->second.second);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could return early if transfers.size() is zero right? Idk how likely this is in practice, but we should probably add an else and a sync here right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with a fix: if transfers.size() is empty, then we check if result needs to be synced or not. We need to sync if memcpy_device_to_host_async is invoked.

@peizhang56 peizhang56 force-pushed the perf/kernel-gap-fixes branch from 445b29e to 9a70d5f Compare April 24, 2026 20:08
…nel launcher

Eliminate per-launch hipMallocAsync/hipFreeAsync and per-copy host
blocking on the AMDGPU kernel launcher hot path.

Why: hipMallocAsync/hipFreeAsync on ROCm carry mutex + CLR-bookkeeping
overhead per call even on a primed pool. Pool tuning cannot fix it;
caching at the application layer can.

Changes:
* New DeviceScratchBuffer (RAII): per-handle device buffer, lazy alloc,
  grow on demand, freed in the dtor. Used for arg + result buffers.
* Arg-buffer H2D and result-buffer D2H switched to async on the default
  stream so they overlap kernel execution.
* Single stream_synchronize after the async result D2H so the caller's
  get_ret() always reads stable host memory (independent of whether the
  destination is pageable or pinned).
* Dropped the pre-kernel stream_synchronize: the only path that fed it
  used the synchronous H2D, which already drains the default stream, so
  the sync was a guaranteed no-op.

Net effect on the launcher hot path: 0 mallocs/frees, 0 sync memcpys,
at most 1 stream_synchronize per launch (only when the host actually
needs to read back results or transfers).

Prior art: structurally equivalent to Phase 1 (async memcpy) and
Phase 2a/b/c (per-handle scratch-buffer caching) of Grant Pinkert's PR
#9 (perf/async-hip-memcpy-l1), landed independently on
the AMD integration branch. Async-memcpy idea from Hugh Perkins'
hp/streams-quadrantsic-2-amdgpu-cpu work (March 2026); see PR #9 for
the full lineage.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
Co-Authored-By: Hugh Perkins <hughperkins@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@peizhang56 peizhang56 force-pushed the perf/kernel-gap-fixes branch 2 times, most recently from 50ae365 to 0e46d9a Compare April 24, 2026 21:46
@peizhang56
Copy link
Copy Markdown
Author

/run_ci

Comment on lines +108 to +123
DeviceScratchBuffer(DeviceScratchBuffer &&other) noexcept
: stream_(other.stream_),
ptr_(other.ptr_),
capacity_(other.capacity_) {
other.ptr_ = nullptr;
other.capacity_ = 0;
}

DeviceScratchBuffer &operator=(DeviceScratchBuffer &&other) noexcept {
if (this != &other) {
release();
stream_ = other.stream_;
ptr_ = other.ptr_;
capacity_ = other.capacity_;
other.ptr_ = nullptr;
other.capacity_ = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these two functions pretty much the same

DeviceScratchBuffer &operator=(const DeviceScratchBuffer &) = delete;

// Move preserves stream affinity: the destination operates on the same
// stream the source was bound to. The source is left in a valid empty
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it is an "valid but unspecified state"

// stream the source was bound to. The source is left in a valid empty
// state on its (unchanged) stream.
DeviceScratchBuffer(DeviceScratchBuffer &&other) noexcept
: stream_(other.stream_),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other.stream_ is still alive in this case

//
// The returned pointer is invalidated by any subsequent ensure() call
// that grows the buffer; callers must not retain it across launches.
char *ensure(std::size_t min_bytes) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ensuring that there are min_bytes on the internal buffer? And then returning the ptr with at least that many bytes? Maybe a better name would be t


private:
void *stream_{nullptr};
char *ptr_{nullptr};
Copy link
Copy Markdown

@gpinkert gpinkert Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the type a char? would a uint8_t or a std::byte be more clear?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is char, because caller was using the char*, technically, it should be void*

Comment on lines +136 to +143
if (ptr_ != nullptr) {
AMDGPUDriver::get_instance().mem_free_async(ptr_, stream_);
ptr_ = nullptr;
}
AMDGPUDriver::get_instance().malloc_async(
reinterpret_cast<void **>(&ptr_), min_bytes, stream_);
capacity_ = min_bytes;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void* new_ptr = nullptr;

AMDGPUDriver::get_instance().malloc_async(&new_ptr, min_bytes, stream_);

if (ptr_ != nullptr) {
  AMDGPUDriver::get_instance().mem_free_async(ptr_, stream_);
}

ptr_ = static_cast<char*>(new_ptr);
capacity_ = min_bytes;

This way we keep the previous allocation in case the new malloc_async failed.

@jamesETsmith
Copy link
Copy Markdown
Collaborator

/run-ci

@gpinkert gpinkert merged commit 34b3712 into amd-integration Apr 25, 2026
38 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants