[Perf] Streams 2: Add AMDGPU/HIP stream support#408
[Perf] Streams 2: Add AMDGPU/HIP stream support#408hughperkins wants to merge 60 commits intohp/streams-quadrantsic-1-cuda-streamsfrom
Conversation
Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).
…quadrantsic-2-amdgpu-cpu
Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.
Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.
Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.
…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.
|
Opus review (written before last 6 commits above): Code Review:
|
This reverts commit 60d015b.
|
I'm reverting the relax of timing threshold in the tests. I think 1.5x is alreayd pretty relaxed |
| PER_AMDGPU_FUNCTION(stream_wait_event, | ||
| hipStreamWaitEvent, | ||
| void *, | ||
| void *, | ||
| uint32); |
There was a problem hiding this comment.
Maybe it is time to increase linewidth of C++ code. 80 chars is painful nowadays. I would recommend either 100 or 120, as a matter of taste.
| // Here we have to guarantee the result_result_buffer isn't nullptr | ||
| // It is interesting - The code following | ||
| // L60: DeviceAllocation devalloc = | ||
| // executor->allocate_memory_on_device( call another kernel and it will result | ||
| // in | ||
| // Memory access fault by GPU node-1 (Agent handle: 0xeda5ca0) on address | ||
| // (nil). Reason: Page not present or supervisor privilege. | ||
| // if you don't allocate it. |
There was a problem hiding this comment.
Why did you remove this comment? It is no longer applicable? It is irrelevant?
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu Made-with: Cursor # Conflicts: # quadrants/rhi/amdgpu/amdgpu_context.cpp # quadrants/rhi/amdgpu/amdgpu_context.h # quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h # quadrants/runtime/amdgpu/kernel_launcher.cpp
Made-with: Cursor
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu Made-with: Cursor # Conflicts: # quadrants/rhi/amdgpu/amdgpu_context.h # quadrants/runtime/amdgpu/kernel_launcher.cpp
The pure-Python perf_dispatch timing test is unreliable on Mac Metal and Vulkan (MoltenVK) where timing differences between implementations are too small to consistently pick the fastest one. Made-with: Cursor
…in test_perf_dispatch.py (take base branch's narrower exclude list) Made-with: Cursor
|
migrated to use single PR on streams 4 |
…quadrantsic-2-amdgpu-cpu
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 100% · Overall: 73% · 61 lines, 0 missing
…quadrantsic-2-amdgpu-cpu
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
98% | 270 |
Diff coverage: 98% · Overall: 74% · 61 lines, 1 missing
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing
Streams + autodiff now throws an exception (base branch commit 360adc8), so our stream-aware adstack fixes in shared code (llvm_runtime_executor.cpp) and the AMDGPU resolve_num_threads async DtoH are unnecessary. Take the base branch version for those shared-file sections. Re-apply AMDGPU-specific non-autodiff stream work: active_stream async DtoH/HtoD in launch_llvm_kernel, context_synchronize/device_synchronize in synchronize(). Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
The comment explains a non-obvious race condition: context_pointer must be freed directly (now via mem_free_async on active_stream) rather than through AMDGPUContext's deferred free list, because that list is drained by LlvmRuntimeExecutor::synchronize which can be called mid-launch. Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing
Base branch extracted stream/event methods from program.cpp into program_stream.cpp (CUDA only). Add the AMDGPU branches with make_current() calls to all nine methods in the new file. Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…pile_config() The base branch refactored stream/event methods from Program:: to StreamManager::, which stores the arch in arch_. Our AMDGPU branches still referenced compile_config().arch which is a Program method. Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
98% | 270 |
Diff coverage: 98% · Overall: 74% · 61 lines, 1 missing
…quadrantsic-2-amdgpu-cpu
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…sis-Embodied-AI/gstaichi into hp/streams-quadrantsic-2-amdgpu-cpu
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93cd1660eb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| #endif | ||
| #ifdef QD_WITH_AMDGPU | ||
| if (arch_ == Arch::amdgpu && stream_handle != 0) { |
There was a problem hiding this comment.
Synchronize default AMDGPU stream when handle is zero
Remove the stream_handle != 0 guard here so the default stream can be synchronized. In this API, handle 0 represents the default stream (as used by Event.wait()/record() defaults and by backend stream semantics), but this branch currently turns stream_synchronize(0) into a no-op on AMDGPU, so callers that explicitly synchronize the default stream can proceed while queued work is still running.
Useful? React with 👍 / 👎.
The stream_handle != 0 guard made synchronize_stream a no-op for the default stream on AMDGPU, unlike the CUDA path. HIP supports hipStreamSynchronize(nullptr), so remove the guard to match CUDA semantics. Co-authored-by: Cursor <cursoragent@cursor.com>
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f5a868958
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| speedup = serial_time / stream_time | ||
| if qd.lang.impl.current_cfg().arch in (qd.cuda, qd.amdgpu): | ||
| assert speedup > 1.5, f"Expected >1.5x speedup, got {speedup:.2f}x" |
There was a problem hiding this comment.
Relax fixed 1.5x stream-speedup assertion
This assertion makes the test depend on hardware/driver scheduling behavior rather than correctness: CUDA/HIP stream concurrency is workload- and device-dependent, so valid implementations can produce <1.5x speedup (for example on small or heavily occupied GPUs, or when concurrent kernels are not co-scheduled). That can cause nondeterministic CI failures even when stream/event semantics are correct; the test should assert functional ordering/correctness and use a looser or optional perf check.
Useful? React with 👍 / 👎.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing
…reams' into hp/streams-quadrantsic-2-amdgpu-cpu
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 100% · Overall: 74% · 61 lines, 0 missing
Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough