Skip to content

[Perf] Streams 2: Add AMDGPU/HIP stream support#408

Open
hughperkins wants to merge 60 commits intohp/streams-quadrantsic-1-cuda-streamsfrom
hp/streams-quadrantsic-2-amdgpu-cpu
Open

[Perf] Streams 2: Add AMDGPU/HIP stream support#408
hughperkins wants to merge 60 commits intohp/streams-quadrantsic-1-cuda-streamsfrom
hp/streams-quadrantsic-2-amdgpu-cpu

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Mirrors the CUDA stream implementation for HIP: adds stream_ member
to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/
mem_free_async to HIP driver functions, and AMDGPU branches in all
Program stream/event methods. Converts AMDGPU kernel launcher to use
async memory operations through the active stream. CPU backend
returns 0 handles (no-op).
@hughperkins hughperkins marked this pull request as draft March 11, 2026 23:54
Batch the device_result_buffer free into the stream pipeline before the
sync barrier, matching the CUDA kernel launcher's ordering for
consistency and marginal performance improvement.
Use memcpy_host_to_device_async for external array transfers so they
are properly ordered on the active stream, matching the CUDA launcher.
Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in
CI under contention, and print actual timings for diagnostics.
…ead_local

Mirror the CUDA fixes: guard stream_synchronize against handle==0 to
avoid unintentional default stream sync, and make AMDGPUContext::stream_
thread_local for thread-safety.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Opus review (written before last 6 commits above):

Code Review: hp/streams-quadrantsic-2-amdgpu-cpu

Summary

This PR adds AMDGPU (HIP) support for the stream and event APIs that were previously implemented only for CUDA. It touches three areas:

  1. program.cpp -- Mirrors every CUDA stream/event entry point with an AMDGPU equivalent under #ifdef QD_WITH_AMDGPU
  2. AMDGPU RHI layer (amdgpu_context.h/cpp, amdgpu_driver_functions.inc.h) -- Adds stream_ member, getter/setter, missing HIP driver function bindings, and wires the active stream through kernel launch
  3. AMDGPU kernel launcher -- Migrates from synchronous malloc/memcpy/mem_free to their async variants, stream-aware throughout
  4. Tests -- Extends test_create_and_destroy_stream and test_create_and_destroy_event to AMDGPU; adds a new concurrency timing test

The approach is sound: it closely mirrors the existing CUDA stream implementation, and the diff is consistent and well-scoped.


Architecture & Design

The pattern of duplicating each #ifdef QD_WITH_CUDA block with an #ifdef QD_WITH_AMDGPU block in program.cpp is consistent with the existing codebase style. However, this does produce a lot of nearly-identical code (9 blocks). This is a pre-existing design issue and not something to block the PR on, but it's worth noting that a future refactor toward a backend-agnostic GPUDriver interface could eliminate this duplication.

The AMDGPU kernel launcher now structurally matches the CUDA one closely, which is the right approach.


Issues

1. Operation ordering divergence from CUDA kernel launcher (minor)

In the CUDA kernel launcher, mem_free_async(device_result_buffer) is queued on the stream before the transfers synchronization:

// quadrants/runtime/cuda/kernel_launcher.cpp:144-198
// ... kernel launches ...
if (ctx.arg_buffer_size > 0) {
    CUDADriver::get_instance().mem_free_async(device_arg_buffer, active_stream);
}
if (ctx.result_buffer_size > 0) {
    CUDADriver::get_instance().memcpy_device_to_host_async(
        host_result_buffer, device_result_buffer, ctx.result_buffer_size,
        active_stream);
}
CUDADriver::get_instance().mem_free_async(device_result_buffer,
                                          active_stream);
// copy data back to host
if (transfers.size() > 0) {
    CUDADriver::get_instance().stream_synchronize(active_stream);

In the AMDGPU version, mem_free_async(device_result_buffer) comes after the transfers sync:

// quadrants/runtime/amdgpu/kernel_launcher.cpp:168-180
if (transfers.size()) {
    AMDGPUDriver::get_instance().stream_synchronize(active_stream);
    for (auto itr = transfers.begin(); itr != transfers.end(); itr++) {
        // ...
    }
}
AMDGPUDriver::get_instance().mem_free_async(device_result_buffer,
                                            active_stream);

This isn't a correctness bug, but it's a minor inefficiency: in the CUDA version, the free is batched into the stream pipeline before the sync barrier; in the AMDGPU version, the sync has already drained the stream, so mem_free_async runs on an idle stream. Consider matching the CUDA ordering for consistency and marginal performance.

2. memcpy_host_to_device for external arrays is still synchronous

// quadrants/runtime/amdgpu/kernel_launcher.cpp:71-72
AMDGPUDriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

The external array transfer path still uses synchronous memcpy_host_to_device, which will operate on the default stream rather than active_stream. This is consistent with the CUDA launcher (which does the same), but worth noting: if a user sets a custom stream and passes host-backed external arrays, the transfer won't be ordered with respect to that stream. This is a pre-existing limitation in both backends.

3. free_kernel_arg_pointer uses synchronous mem_free on async-allocated memory

// quadrants/rhi/amdgpu/amdgpu_context.h:44-50
void free_kernel_arg_pointer() {
    for (auto &i : kernel_arg_pointer_) {
        AMDGPUDriver::get_instance().mem_free(i);
    }
    kernel_arg_pointer_.erase(kernel_arg_pointer_.begin(),
                              kernel_arg_pointer_.end());
}

context_pointer is allocated with malloc_async on active_stream but freed later via mem_free (synchronous). Mixing is allowed by HIP, and the synchronous free will implicitly wait, so this isn't a correctness issue. But if free_kernel_arg_pointer is ever called while the kernel is still in flight on a non-default stream, the synchronous free will stall. This is pre-existing and not introduced by this PR.

4. hipMallocAsync / hipFreeAsync minimum ROCm version requirement

The old code had a comment noting "Malloc_Async and Free_Async are available after ROCm 5.4". This comment was removed in the diff (which is fine since it was in the wrong place), but there's no build-system or runtime check ensuring the user's HIP runtime supports these functions. If someone runs this on ROCm < 5.4, the PER_AMDGPU_FUNCTION dynamic symbol lookup will fail at runtime. Consider documenting the minimum ROCm version requirement, or adding a graceful fallback.


Test Feedback

5. Timing-based concurrency test may be flaky in CI

The test_concurrent_streams_with_events test asserts speedup > 1.5 on GPU. Timing-based tests are inherently fragile, especially in shared CI environments or on GPUs under contention. The warmup and JIT compilation are handled well, but 5 million spin iterations may not produce enough wall-clock time for a stable ratio on very fast GPUs (where launch overhead dominates), or could be too slow on older GPUs.

Consider:

  • Logging the actual timings (serial/stream) for diagnostic purposes even on pass
  • Using a more relaxed threshold (e.g., > 1.3) to reduce flakiness
  • Adding a pytest.mark.slow or similar marker if this isn't intended to run in every CI pass

6. test_concurrent_streams_with_events uses @test_utils.test() (all archs) including CPU

The test correctly handles CPU with a relaxed speedup > 0.75 check. However, the CPU path still runs two heavy spin loops (5M iterations each) serially, which adds significant wall time to the CPU test suite. If this is a concern, you could restrict the test to GPU-only and keep the CPU path in a lighter test.


Nits

  • The removed comment block in kernel_launcher.cpp (lines 37-44 in the old code) was verbose but actually contained useful context about why device_result_buffer must always be allocated. The replacement comment is more concise but could lose the specific error message context. The new comment is adequate.

  • In program.cpp, the function set_current_cuda_stream is semantically misnamed for AMDGPU -- it sets the HIP stream, not a CUDA stream. This is pre-existing (the function name comes from the base branch), but might be worth a follow-up rename to something like set_current_gpu_stream.


Verdict

The PR is clean, well-structured, and correctly mirrors the CUDA stream implementation for AMDGPU. The core changes (async memory operations, stream plumbing through context and kernel launch, driver function bindings) are correct. The issues raised are mostly minor consistency/efficiency points and pre-existing design concerns. I'd recommend addressing items 1 (ordering) and 4 (ROCm version) before merging; the rest can be tracked as follow-ups.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

I'm reverting the relax of timing threshold in the tests. I think 1.5x is alreayd pretty relaxed

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 01:24
@hughperkins hughperkins changed the title [Perf] Streams part 2: Add AMDGPU/HIP stream support [Perf] Streams 2: Add AMDGPU/HIP stream support Mar 12, 2026
@hughperkins hughperkins marked this pull request as draft March 12, 2026 04:59
Comment on lines +128 to +132
PER_AMDGPU_FUNCTION(stream_wait_event,
hipStreamWaitEvent,
void *,
void *,
uint32);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is time to increase linewidth of C++ code. 80 chars is painful nowadays. I would recommend either 100 or 120, as a matter of taste.

Comment on lines -36 to -43
// Here we have to guarantee the result_result_buffer isn't nullptr
// It is interesting - The code following
// L60: DeviceAllocation devalloc =
// executor->allocate_memory_on_device( call another kernel and it will result
// in
// Memory access fault by GPU node-1 (Agent handle: 0xeda5ca0) on address
// (nil). Reason: Page not present or supervisor privilege.
// if you don't allocate it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this comment? It is no longer applicable? It is irrelevant?

…reams' into hp/streams-quadrantsic-2-amdgpu-cpu

Made-with: Cursor

# Conflicts:
#	quadrants/rhi/amdgpu/amdgpu_context.cpp
#	quadrants/rhi/amdgpu/amdgpu_context.h
#	quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h
#	quadrants/runtime/amdgpu/kernel_launcher.cpp
Made-with: Cursor
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu

Made-with: Cursor

# Conflicts:
#	quadrants/rhi/amdgpu/amdgpu_context.h
#	quadrants/runtime/amdgpu/kernel_launcher.cpp
The pure-Python perf_dispatch timing test is unreliable on Mac Metal
and Vulkan (MoltenVK) where timing differences between implementations
are too small to consistently pick the fastest one.

Made-with: Cursor
…in test_perf_dispatch.py (take base branch's narrower exclude list)

Made-with: Cursor
@hughperkins
Copy link
Copy Markdown
Collaborator Author

migrated to use single PR on streams 4

@hughperkins hughperkins reopened this Apr 28, 2026
@github-actions
Copy link
Copy Markdown

Coverage Report (228150ad2)

File Coverage Missing
🟢 tests/python/test_streams.py 100%

Diff coverage: 100% · Overall: 73% · 61 lines, 0 missing

Full annotated report

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Coverage Report (ce2fc6bfd)

File Coverage Missing
🟢 tests/python/test_streams.py 98% 270

Diff coverage: 98% · Overall: 74% · 61 lines, 1 missing

Full annotated report

…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Coverage Report (117a71ffb)

File Coverage Missing
🟢 tests/python/test_streams.py 100%

Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing

Full annotated report

hughperkins and others added 3 commits May 2, 2026 01:46
Streams + autodiff now throws an exception (base branch commit 360adc8),
so our stream-aware adstack fixes in shared code (llvm_runtime_executor.cpp)
and the AMDGPU resolve_num_threads async DtoH are unnecessary. Take the base
branch version for those shared-file sections. Re-apply AMDGPU-specific
non-autodiff stream work: active_stream async DtoH/HtoD in
launch_llvm_kernel, context_synchronize/device_synchronize in synchronize().

Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
The comment explains a non-obvious race condition: context_pointer must be
freed directly (now via mem_free_async on active_stream) rather than through
AMDGPUContext's deferred free list, because that list is drained by
LlvmRuntimeExecutor::synchronize which can be called mid-launch.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (6e49c52d1)

File Coverage Missing
🟢 tests/python/test_streams.py 100%

Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing

Full annotated report

hughperkins and others added 5 commits May 2, 2026 04:57
Base branch extracted stream/event methods from program.cpp into
program_stream.cpp (CUDA only). Add the AMDGPU branches with
make_current() calls to all nine methods in the new file.

Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…pile_config()

The base branch refactored stream/event methods from Program:: to
StreamManager::, which stores the arch in arch_. Our AMDGPU branches
still referenced compile_config().arch which is a Program method.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (d3317f5cf)

File Coverage Missing
🟢 tests/python/test_streams.py 98% 270

Diff coverage: 98% · Overall: 74% · 61 lines, 1 missing

Full annotated report

hughperkins and others added 7 commits May 2, 2026 12:19
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…sis-Embodied-AI/gstaichi into hp/streams-quadrantsic-2-amdgpu-cpu
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (93cd1660e)

File Coverage Missing
🟢 tests/python/test_streams.py 100%

Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing

Full annotated report

@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93cd1660eb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/program/program_stream.cpp Outdated
}
#endif
#ifdef QD_WITH_AMDGPU
if (arch_ == Arch::amdgpu && stream_handle != 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Synchronize default AMDGPU stream when handle is zero

Remove the stream_handle != 0 guard here so the default stream can be synchronized. In this API, handle 0 represents the default stream (as used by Event.wait()/record() defaults and by backend stream semantics), but this branch currently turns stream_synchronize(0) into a no-op on AMDGPU, so callers that explicitly synchronize the default stream can proceed while queued work is still running.

Useful? React with 👍 / 👎.

hughperkins and others added 2 commits May 2, 2026 12:18
The stream_handle != 0 guard made synchronize_stream a no-op for the
default stream on AMDGPU, unlike the CUDA path. HIP supports
hipStreamSynchronize(nullptr), so remove the guard to match CUDA
semantics.

Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f5a868958

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


speedup = serial_time / stream_time
if qd.lang.impl.current_cfg().arch in (qd.cuda, qd.amdgpu):
assert speedup > 1.5, f"Expected >1.5x speedup, got {speedup:.2f}x"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Relax fixed 1.5x stream-speedup assertion

This assertion makes the test depend on hardware/driver scheduling behavior rather than correctness: CUDA/HIP stream concurrency is workload- and device-dependent, so valid implementations can produce <1.5x speedup (for example on small or heavily occupied GPUs, or when concurrent kernels are not co-scheduled). That can cause nondeterministic CI failures even when stream/event semantics are correct; the test should assert functional ordering/correctness and use a looser or optional perf check.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (3f5a86895)

File Coverage Missing
🟢 tests/python/test_streams.py 100%

Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing

Full annotated report

…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Coverage Report (392b19a3c)

File Coverage Missing
🟢 tests/python/test_streams.py 100%

Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing

Full annotated report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants