[Perf] Streams 1: Add CUDA stream and event API#407
[Perf] Streams 1: Add CUDA stream and event API#407hughperkins wants to merge 46 commits intomainfrom
Conversation
Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.
- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize
|
Review from Opus (written before the last commit above): PR Review: Add CUDA Stream and Event APIBranch: SummaryThis PR introduces a CUDA stream and event API to enable concurrent kernel execution on separate GPU streams. It adds:
The design is clean and well-layered. On non-CUDA backends, everything degrades to no-ops (handle=0). Issues and Concerns1. Thread-safety of
|
| """Wraps a backend-specific GPU stream for concurrent kernel execution. | ||
|
|
||
| On backends without native streams (e.g. CPU), this is a no-op object. | ||
| Call destroy() explicitly or use as a context manager to ensure cleanup. |
There was a problem hiding this comment.
I would rather pretend it can only be used as context manager, aligning with the API for torch.profiler. Because managing streams manually without context sounds a bad practice and should be made easy.
|
|
||
|
|
||
| class Event: | ||
| """Wraps a backend-specific GPU event for stream synchronization. |
There was a problem hiding this comment.
Could you clarify what is an "event" in the documentation? I have no idea what it is.
| if self._handle != 0: | ||
| prog = impl.get_runtime().prog | ||
| prog.event_destroy(self._handle) | ||
| self._handle = 0 | ||
|
|
||
| def __del__(self): | ||
| if self._handle != 0 and self._prog_ref is not None: |
There was a problem hiding this comment.
Personally I prefer if self._handle:. It is more clear semantically. Whether it is an int or some more complex object does not matter much.
| .def("stream_create", &Program::stream_create) | ||
| .def("stream_destroy", &Program::stream_destroy) | ||
| .def("stream_synchronize", &Program::stream_synchronize) | ||
| .def("set_current_cuda_stream", &Program::set_current_cuda_stream) | ||
| .def("event_create", &Program::event_create) | ||
| .def("event_destroy", &Program::event_destroy) | ||
| .def("event_record", &Program::event_record) | ||
| .def("event_synchronize", &Program::event_synchronize) | ||
| .def("stream_wait_event", &Program::stream_wait_event); |
There was a problem hiding this comment.
what is cuda-specific and what is not? Only 'set_current_cuda_stream' is cuda specific? if so, stream are still usable on other backend or this function is necessary to make it useful?
|
|
||
| // Stream management | ||
| PER_CUDA_FUNCTION(stream_create, cuStreamCreate, void **, uint32); | ||
| PER_CUDA_FUNCTION(stream_destroy, cuStreamDestroy_v2, void *); |
There was a problem hiding this comment.
What is 'cuStreamDestroy_v2' ? very weird name.
Why do we have functions with '_v2' suffix at multiple places?
| @@ -242,11 +242,11 @@ def fun(value: qd.types.ndarray(), offset: qd.template()): | |||
| qd_init_same_arch(offline_cache_file_path=str(tmp_path), offline_cache=True) | |||
| is_valid = False | |||
|
|
|||
| def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args): | |||
| def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=None): | |||
| nonlocal is_valid | |||
| is_valid = True | |||
| assert compiled_kernel_data is not None | |||
| return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args) | |||
| return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=qd_stream) | |||
There was a problem hiding this comment.
I would rather follow the existing pattern and move 'qd_stream' before *args.
Moreover, I see no reason to prefix stream with qd. What does it mean? This is quadrants projects, so of course it is related to quadrants. It is just a gpu stream no? Just to clarify it is a gpu computation stream, not just some random stream? I don't think it is necessary, you are passing this to functions like 'launch_kernel', of course it is about launching kernels.
…c-1-cuda-streams Made-with: Cursor # Conflicts: # python/quadrants/lang/kernel.py # quadrants/python/export_lang.cpp # quadrants/rhi/cuda/cuda_context.cpp # quadrants/runtime/cuda/kernel_launcher.cpp
Made-with: Cursor
…c-1-cuda-streams Made-with: Cursor # Conflicts: # quadrants/runtime/cuda/kernel_launcher.cpp
The pure-Python perf dispatch test is timing-sensitive and unreliable on the Vulkan software renderer in CI. The kernel variant of the same test still covers perf dispatch on Vulkan. Made-with: Cursor
|
migrated to use single PR on streams 4 |
271f23d to
f64c497
Compare
f64c497 to
cd5b486
Compare
…c-1-cuda-streams # Conflicts: # quadrants/rhi/cuda/cuda_context.h
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 18 |
🟢 python/quadrants/lang/kernel.py |
90% | 447 |
🔴 python/quadrants/lang/stream.py |
40% | 1,3,6,10,17,21-22,25,30,37,39-45,47-48,50-51,54,61,65-66,69,75,81,83-84,86,93,95-101,103-104,106-107,110,117,124 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
100% |
Diff coverage: 80% · Overall: 73% · 238 lines, 48 missing
Streams are not compatible with reverse-mode or forward-mode differentiation. The adstack sizer and Tape replay paths assume the default stream; rather than fixing every race, block the combination at the Python entry point with a clear error message. Co-authored-by: Cursor <cursoragent@cursor.com>
Autodiff+streams is now blocked at the Python level, so the adstack code path never runs on a non-default stream. Remove the unnecessary stream_synchronize we added in publish_adstack_metadata. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@claude review |
Our branch had a stale copy of publish_adstack_metadata and ensure_adstack_heap that conflicted with upstream's refactor into ensure_adstack_heap_float / ensure_adstack_heap_int. Since autodiff is now blocked with streams at the Python level, we have no changes to make in this file. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
48% | 1,3,6,10,17,21-22,25,30,40,42-48,50,53,57,64,68-69,72,78,84,89,99,101-107,109,112,116,123,130 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 252,268,283-284 |
Diff coverage: 85% · Overall: 67% · 316 lines, 47 missing
Move the 9 CUDA-only stream/event Program methods into a dedicated translation unit. The CMake glob on quadrants/program/* picks up the new file automatically. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the CUDA stream/event logic into a StreamManager class (program_stream.h/.cpp). Program keeps its public API unchanged and delegates to stream_manager_ internally, so the pybind layer and Python code need no changes. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add stream_manager() accessor to Program. Update export_lang.cpp to call StreamManager methods through lambdas. Delete the 9 one-line delegation methods from Program — the declarations in program.h and definitions in program_stream.cpp are both gone. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1b4ee60b2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._handle != 0 and self._prog_ref is not None: | ||
| prog = impl.get_runtime().prog | ||
| prog.stream_destroy(self._handle) |
There was a problem hiding this comment.
Destroy streams via their owning Program
Stream.destroy() uses impl.get_runtime().prog instead of the stream’s captured program reference, so after qd.reset() (or any runtime swap) it can raise _prog attribute not initialized or call into the wrong backend, leaving the original CUDA stream undeleted; this breaks the method’s “safe to call multiple times” contract and can leak stream resources across resets. The same ownership bug pattern also exists in Event.destroy().
Useful? React with 👍 / 👎.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
48% | 1,3,6,10,17,21-22,25,30,40,42-48,50,53,57,64,68-69,72,78,84,89,99,101-107,109,112,116,123,130 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 252,268,283-284 |
Diff coverage: 85% · Overall: 74% · 316 lines, 47 missing
Co-authored-by: Cursor <cursoragent@cursor.com>
All methods on Stream and Event now resolve the Program through the captured weakref first, falling back to the current runtime only for externally-wrapped handles. Fixes a bug where destroy/synchronize/ record/wait would call into the wrong Program after qd.reset(). Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 614c742cd9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (arch_ == Arch::cuda && stream_handle != 0) { | ||
| CUDAContext::get_instance().make_current(); | ||
| CUDADriver::get_instance().stream_synchronize(reinterpret_cast<void *>(stream_handle)); |
There was a problem hiding this comment.
Synchronize default-stream wrappers instead of no-op
synchronize_stream() currently skips handle 0, but the CUDA launcher elsewhere treats nullptr as a valid “default stream” handle (e.g., sync paths call stream_synchronize(nullptr)). This makes Stream(0).synchronize() a no-op on CUDA, so wrappers around external/default streams (such as interop stream pointers) can return before queued work is finished.
Useful? React with 👍 / 👎.
| prog = self._prog_ref() | ||
| if prog is not None: | ||
| return prog | ||
| return impl.get_runtime().prog |
There was a problem hiding this comment.
Avoid rebinding stale stream handles to a new runtime
When a stream was created from an older program instance, _prog_ref() can expire after qd.reset(). _prog() then falls back to impl.get_runtime().prog, so later destroy()/synchronize() calls may execute against a different program/context using a stale handle, which can trigger invalid-handle behavior instead of safe cleanup. This should either no-op or raise once the owning program is gone.
Useful? React with 👍 / 👎.
When _prog_ref is set but the weakref has expired (Program destroyed), _prog() now returns None instead of falling back to the current runtime. Active operations (synchronize, record, wait) raise RuntimeError; destroy silently no-ops and zeroes the handle. Also allow synchronize_stream(0) to sync the default stream in CUDA, matching cuStreamSynchronize(nullptr) semantics. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the 9 stream/event .def() bindings from export_lang.cpp into a new export_stream.cpp, following the existing export_math/export_misc pattern. Satisfies the feature-factorization check for the 1225-line export_lang.cpp. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
48% | 1,3,6,10,17,21-22,25,29,31,35,38,50,52-58,60,63,67,74,78-79,82,86,88,91,94,99,104,108,120,122-128,130,133,137,144,151 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 252,268,283-284 |
Diff coverage: 84% · Overall: 74% · 330 lines, 54 missing
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b09331daf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| prog = self._prog() | ||
| if prog is not None: | ||
| prog.stream_destroy(self._handle) | ||
| self._handle = 0 |
There was a problem hiding this comment.
Destroy orphaned CUDA streams instead of dropping the handle
When qd.reset() (or any Program teardown) happens before Stream.destroy(), _prog_ref() can be None; this branch skips prog.stream_destroy(...) but still zeroes _handle, so the underlying CUDA stream is never freed and cannot be reclaimed later. Because CUDA context is shared across Programs, repeated create/reset cycles can leak stream resources over time. Use the current runtime Program as a fallback destroy path for owned streams (_prog_ref is not None) rather than silently dropping the handle.
Useful? React with 👍 / 👎.
| prog = self._prog() | ||
| if prog is not None: | ||
| prog.event_destroy(self._handle) | ||
| self._handle = 0 |
There was a problem hiding this comment.
Destroy orphaned CUDA events instead of dropping the handle
The same ownership gap exists for Event.destroy(): after a reset, _prog_ref() may be dead, so event_destroy is skipped but _handle is cleared. That loses the only reference without releasing the underlying CUDA event, which accumulates across reset-heavy workflows and cannot be cleaned up afterward. For owned events (_prog_ref is not None), fall back to the current runtime Program to destroy the event when the original Program object is gone.
Useful? React with 👍 / 👎.
When the owning Program has been collected (e.g. after qd.reset()), destroy() and __del__ now fall back to the current runtime's Program to free the underlying CUDA resource. This is safe because CUDAContext is a singleton, so stream/event handles remain valid across Programs. Prevents resource leaks in create/reset cycles. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48c3922aca
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| CUDADriver::get_instance().memcpy_host_to_device_async((void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz, | ||
| active_stream); |
There was a problem hiding this comment.
Serialize temp-buffer allocs with custom stream copies
This copy is now enqueued on active_stream, but temporary buffers for external host arrays are still allocated/freed through allocator paths that use stream nullptr (default stream). Because create_stream() creates CU_STREAM_NON_BLOCKING streams, default-stream alloc/free operations are not implicitly ordered with this stream, so a kernel call with qd_stream= and NumPy/PyTorch host arrays can hit use-before-allocation/use-after-free races (illegal address/invalid value) on memory-pool-enabled CUDA devices. The temp allocation/copy/free sequence needs to share one stream or be linked with explicit events.
Useful? React with 👍 / 👎.
| prog = self._prog() | ||
| if prog is None: | ||
| try: | ||
| return impl.get_runtime().prog |
There was a problem hiding this comment.
Destroy stream/event with original program backend
When the owning program is gone (e.g. after qd.reset()), cleanup falls back to impl.get_runtime().prog. If the runtime was reinitialized on a non-CUDA arch, later destroy()/__del__ calls route through a non-CUDA Program and become no-ops for nonzero CUDA handles, leaking CUDA streams/events in long-lived processes that reconfigure backends. Cleanup should preserve and use the original CUDA-capable program/backend for handle destruction.
Useful? React with 👍 / 👎.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
47% | 1,3,6,10,17,21-22,25,29,31,35,38,46-49,52,63,65-71,73,76,80,87,91-92,95,99,101,104,107,112,117,121,129-132,135,146,148-154,156,159,163,170,177 |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 252,268,283-284 |
Diff coverage: 82% · Overall: 74% · 346 lines, 64 missing
Co-authored-by: Cursor <cursoragent@cursor.com>
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 python/quadrants/lang/__init__.py |
0% | 19 |
🟢 python/quadrants/lang/_fast_caching/args_hasher.py |
100% | |
🟢 python/quadrants/lang/_fast_caching/src_hasher.py |
100% | |
🟢 python/quadrants/lang/kernel.py |
89% | 456,599 |
🔴 python/quadrants/lang/stream.py |
47% | 1,3,6,10,17,21-22,25,29,31,35,38,46-49,52,63,65-71,73,76,80,87,91-92,95,99,101,104,107,112,117,121,129-132,135,146,148-154,156,159,163,170,177 |
🟢 tests/python/quadrants/lang/fast_caching/test_args_hasher.py |
100% | |
🟢 tests/python/test_cache.py |
100% | |
🟢 tests/python/test_perf_dispatch.py |
100% | |
🟢 tests/python/test_streams.py |
98% | 252,268,283-284 |
Diff coverage: 82% · Overall: 74% · 357 lines, 64 missing
Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.
Lines of code added: +481 - 197 - 4 - 4 = +276
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough