[Perf] Streams 1: Add CUDA stream and event API by hughperkins · Pull Request #407 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-11T23:52:04Z

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

Lines of code added: +481 - 197 - 4 - 4 = +276

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize

hughperkins · 2026-03-12T01:15:50Z

Review from Opus (written before the last commit above):

PR Review: Add CUDA Stream and Event API

Branch: hp/streams-quadrantsic-1-cuda-streams
Commit: ab15b1b82 â€” "Add CUDA stream and event API for concurrent kernel execution"
Files changed: 11 (+443, -16)

Summary

This PR introduces a CUDA stream and event API to enable concurrent kernel execution on separate GPU streams. It adds:

Python API (stream.py): Stream and Event wrapper classes with create_stream() / create_event() factory functions
C++ backend (program.cpp/h): 9 new Program methods wrapping CUDA driver calls for stream/event lifecycle
Kernel launch integration (kernel.py): A qd_stream= kwarg on kernel calls that sets the active CUDA stream around launch
Kernel launcher fix (kernel_launcher.cpp): Replaces hardcoded nullptr stream with CUDAContext::get_instance().get_stream() so that async memory ops respect the active stream
pybind11 exports and tests (197 lines of test coverage)

The design is clean and well-layered. On non-CUDA backends, everything degrades to no-ops (handle=0).

Issues and Concerns

1. Thread-safety of `CUDAContext::stream_` (High)

CUDAContext is a singleton. The set_stream / get_stream methods read/write a bare void *stream_ with no synchronization:

// quadrants/rhi/cuda/cuda_context.h:116-118
void set_stream(void *stream) {
    stream_ = stream;
}

The context already has a std::mutex lock_ (line 27), but it is not used here. The kernel launch path in kernel.py does set_current_cuda_stream â†’ launch_kernel â†’ set_current_cuda_stream(0), which is not atomic. If two Python threads launch kernels on different streams, they'll race on stream_, and a kernel could be launched on the wrong stream.

Suggestion: Either protect stream_ with the existing mutex, use thread_local storage for the active stream, or document that concurrent multi-threaded kernel launches with different streams are unsupported.

2. Synchronous `memcpy_host_to_device` not updated (Medium)

In kernel_launcher.cpp, the PR correctly updates all async operations to use active_stream, but lines 90â€“101 still use synchronous memcpy_host_to_device for external host array transfers:

// quadrants/runtime/cuda/kernel_launcher.cpp:90-91
CUDADriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

Synchronous cuMemcpyHtoD implicitly synchronizes the default stream. When a user launches on a non-default stream with host-backed external arrays, this will introduce unintended synchronization with the default stream, potentially defeating the purpose of using separate streams.

Suggestion: Convert these to memcpy_host_to_device_async on active_stream, consistent with the rest of the changes.

3. `del` calling into runtime during interpreter shutdown (Medium)

Both Stream.__del__ and Event.__del__ call self.destroy(), which accesses impl.get_runtime().prog:

# python/quadrants/lang/stream.py:31-36
def __del__(self):
    if self._handle != 0:
        try:
            self.destroy()
        except Exception:
            pass

During Python interpreter shutdown, the runtime/program may already be finalized. The bare except Exception: pass mitigates crashes, but leaked CUDA resources are still possible. Additionally, __del__ timing is non-deterministic â€” the CUDA context itself could be destroyed before these finalizers run.

Suggestion: Consider registering streams/events with the runtime for batch cleanup at shutdown, or use weakref.ref to the program to detect whether cleanup is still possible. At minimum, add a note in the docstring encouraging explicit destroy() calls or context manager usage.

4. Inconsistent parameter naming in `Event` API (Low)

Event.record uses stream as its parameter name, while Event.wait uses qd_stream:

# python/quadrants/lang/stream.py:52-53
def record(self, stream: Stream | None = None):
    """Record this event on a stream. None means the default stream."""

# python/quadrants/lang/stream.py:58-59
def wait(self, qd_stream: Stream | None = None):
    """Make a stream wait for this event. None means the default stream."""

I understand qd_stream is used to avoid colliding with the kernel **kwargs namespace, but that concern doesn't apply to Event.wait â€” it's a standalone method, not a kernel call. These should be consistent. I'd suggest using stream in both places since the qd_stream prefix is an implementation detail of the kernel dispatch path.

5. `stream_synchronize` doesn't guard against handle=0 (Low)

stream_destroy and event_destroy guard against handle == 0, but stream_synchronize does not:

// quadrants/program/program.cpp:527-533
void Program::stream_synchronize(uint64 stream_handle) {
#ifdef QD_WITH_CUDA
  if (compile_config().arch == Arch::cuda) {
    CUDADriver::get_instance().stream_synchronize(
        reinterpret_cast<void *>(stream_handle));
  }
#endif

Calling cuStreamSynchronize(nullptr) synchronizes the default stream, which is technically valid but may be unintentional. The Python Stream class with handle == 0 (CPU backend) will reach this code path and unintentionally synchronize the CUDA default stream if the arch somehow doesn't match. The guard pattern should be consistent across all methods.

Minor Observations

Test coverage is solid. The test suite covers lifecycle, multi-stream, event sync, CPU no-op, and ndarray. Consider adding a test for using a destroyed stream (expect graceful no-op or error) and a test for multiple event records on the same event.
No context manager support. Stream would benefit from __enter__/__exit__ so users can write with qd.create_stream() as s: for automatic cleanup. This is a nice-to-have for a follow-up.
test_kernel_on_stream and similar tests use @test_utils.test() (all arches) which means they run on CPU where create_stream() returns handle=0 and the stream kwarg is a no-op. This tests the no-crash path, which is fine, but the concurrency behavior is only actually tested when arch=[qd.cuda]. Consider being explicit about which tests are CUDA-specific correctness tests vs. CPU-graceful-degradation tests.

Verdict

The PR is well-structured and the API is clean. The kernel launcher changes correctly propagating the active stream to all async memory operations are the most important part and look correct. I'd recommend addressing the thread-safety concern (#1) before merge â€” even if only with documentation â€” and fixing the synchronous memcpy (#2) to avoid subtle correctness issues when users mix host arrays with non-default streams. The naming inconsistency (#4) is a quick fix worth doing now to avoid API churn later.

duburcqa · 2026-03-15T10:12:22Z

+    """Wraps a backend-specific GPU stream for concurrent kernel execution.
+
+    On backends without native streams (e.g. CPU), this is a no-op object.
+    Call destroy() explicitly or use as a context manager to ensure cleanup.


I would rather pretend it can only be used as context manager, aligning with the API for torch.profiler. Because managing streams manually without context sounds a bad practice and should be made easy.

duburcqa · 2026-03-15T10:12:53Z

+
+
+class Event:
+    """Wraps a backend-specific GPU event for stream synchronization.


Could you clarify what is an "event" in the documentation? I have no idea what it is.

duburcqa · 2026-03-15T10:13:51Z

+        if self._handle != 0:
+            prog = impl.get_runtime().prog
+            prog.event_destroy(self._handle)
+            self._handle = 0
+
+    def __del__(self):
+        if self._handle != 0 and self._prog_ref is not None:


Personally I prefer if self._handle:. It is more clear semantically. Whether it is an int or some more complex object does not matter much.

duburcqa · 2026-03-15T10:15:33Z

+      .def("stream_create", &Program::stream_create)
+      .def("stream_destroy", &Program::stream_destroy)
+      .def("stream_synchronize", &Program::stream_synchronize)
+      .def("set_current_cuda_stream", &Program::set_current_cuda_stream)
+      .def("event_create", &Program::event_create)
+      .def("event_destroy", &Program::event_destroy)
+      .def("event_record", &Program::event_record)
+      .def("event_synchronize", &Program::event_synchronize)
+      .def("stream_wait_event", &Program::stream_wait_event);


what is cuda-specific and what is not? Only 'set_current_cuda_stream' is cuda specific? if so, stream are still usable on other backend or this function is necessary to make it useful?

duburcqa · 2026-03-15T10:16:22Z


 // Stream management
 PER_CUDA_FUNCTION(stream_create, cuStreamCreate, void **, uint32);
+PER_CUDA_FUNCTION(stream_destroy, cuStreamDestroy_v2, void *);


What is 'cuStreamDestroy_v2' ? very weird name.

Why do we have functions with '_v2' suffix at multiple places?

duburcqa · 2026-03-15T10:19:30Z

@@ -242,11 +242,11 @@ def fun(value: qd.types.ndarray(), offset: qd.template()):
    qd_init_same_arch(offline_cache_file_path=str(tmp_path), offline_cache=True)
    is_valid = False

-    def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args):
+    def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=None):
        nonlocal is_valid
        is_valid = True
        assert compiled_kernel_data is not None
-        return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args)
+        return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=qd_stream)


I would rather follow the existing pattern and move 'qd_stream' before *args.

Moreover, I see no reason to prefix stream with qd. What does it mean? This is quadrants projects, so of course it is related to quadrants. It is just a gpu stream no? Just to clarify it is a gpu computation stream, not just some random stream? I don't think it is necessary, you are passing this to functions like 'launch_kernel', of course it is about launching kernels.

…c-1-cuda-streams Made-with: Cursor # Conflicts: # python/quadrants/lang/kernel.py # quadrants/python/export_lang.cpp # quadrants/rhi/cuda/cuda_context.cpp # quadrants/runtime/cuda/kernel_launcher.cpp

Made-with: Cursor

…c-1-cuda-streams Made-with: Cursor # Conflicts: # quadrants/runtime/cuda/kernel_launcher.cpp

The pure-Python perf dispatch test is timing-sensitive and unreliable on the Vulkan software renderer in CI. The kernel variant of the same test still covers perf dispatch on Vulkan. Made-with: Cursor

hughperkins · 2026-04-25T08:03:48Z

migrated to use single PR on streams 4

…c-1-cuda-streams # Conflicts: # quadrants/rhi/cuda/cuda_context.h

github-actions · 2026-04-28T17:33:49Z

Coverage Report (`f036b46f5`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	18
🟢 `python/quadrants/lang/kernel.py`	90%	447
🔴 `python/quadrants/lang/stream.py`	40%	1,3,6,10,17,21-22,25,30,37,39-45,47-48,50-51,54,61,65-66,69,75,81,83-84,86,93,95-101,103-104,106-107,110,117,124
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	100%

Diff coverage: 80% · Overall: 73% · 238 lines, 48 missing

Full annotated report

…c-1-cuda-streams

Streams are not compatible with reverse-mode or forward-mode differentiation. The adstack sizer and Tape replay paths assume the default stream; rather than fixing every race, block the combination at the Python entry point with a clear error message. Co-authored-by: Cursor <cursoragent@cursor.com>

Autodiff+streams is now blocked at the Python level, so the adstack code path never runs on a non-default stream. Remove the unnecessary stream_synchronize we added in publish_adstack_metadata. Co-authored-by: Cursor <cursoragent@cursor.com>

hughperkins · 2026-05-02T08:30:38Z

@claude review

Our branch had a stale copy of publish_adstack_metadata and ensure_adstack_heap that conflicted with upstream's refactor into ensure_adstack_heap_float / ensure_adstack_heap_int. Since autodiff is now blocked with streams at the Python level, we have no changes to make in this file. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-02T10:43:43Z

Coverage Report (`f6fee4fbd`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	19
🟢 `python/quadrants/lang/kernel.py`	89%	456,599
🔴 `python/quadrants/lang/stream.py`	48%	1,3,6,10,17,21-22,25,30,40,42-48,50,53,57,64,68-69,72,78,84,89,99,101-107,109,112,116,123,130
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	98%	252,268,283-284

Diff coverage: 85% · Overall: 67% · 316 lines, 47 missing

Full annotated report

Move the 9 CUDA-only stream/event Program methods into a dedicated translation unit. The CMake glob on quadrants/program/* picks up the new file automatically. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the CUDA stream/event logic into a StreamManager class (program_stream.h/.cpp). Program keeps its public API unchanged and delegates to stream_manager_ internally, so the pybind layer and Python code need no changes. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Add stream_manager() accessor to Program. Update export_lang.cpp to call StreamManager methods through lambdas. Delete the 9 one-line delegation methods from Program — the declarations in program.h and definitions in program_stream.cpp are both gone. Co-authored-by: Cursor <cursoragent@cursor.com>

hughperkins · 2026-05-02T12:38:30Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1b4ee60b2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T12:43:54Z

+        if self._handle != 0 and self._prog_ref is not None:
+            prog = impl.get_runtime().prog
+            prog.stream_destroy(self._handle)


Destroy streams via their owning Program

Stream.destroy() uses impl.get_runtime().prog instead of the stream’s captured program reference, so after qd.reset() (or any runtime swap) it can raise _prog attribute not initialized or call into the wrong backend, leaving the original CUDA stream undeleted; this breaks the method’s “safe to call multiple times” contract and can leak stream resources across resets. The same ownership bug pattern also exists in Event.destroy().

Useful? React with 👍 / 👎.

github-actions · 2026-05-02T13:59:17Z

Coverage Report (`b1b4ee60b`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	19
🟢 `python/quadrants/lang/kernel.py`	89%	456,599
🔴 `python/quadrants/lang/stream.py`	48%	1,3,6,10,17,21-22,25,30,40,42-48,50,53,57,64,68-69,72,78,84,89,99,101-107,109,112,116,123,130
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	98%	252,268,283-284

Diff coverage: 85% · Overall: 74% · 316 lines, 47 missing

Full annotated report

Co-authored-by: Cursor <cursoragent@cursor.com>

All methods on Stream and Event now resolve the Program through the captured weakref first, falling back to the current runtime only for externally-wrapped handles. Fixes a bug where destroy/synchronize/ record/wait would call into the wrong Program after qd.reset(). Co-authored-by: Cursor <cursoragent@cursor.com>

hughperkins · 2026-05-02T16:20:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 614c742cd9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T16:25:45Z

+  if (arch_ == Arch::cuda && stream_handle != 0) {
+    CUDAContext::get_instance().make_current();
+    CUDADriver::get_instance().stream_synchronize(reinterpret_cast<void *>(stream_handle));


Synchronize default-stream wrappers instead of no-op

synchronize_stream() currently skips handle 0, but the CUDA launcher elsewhere treats nullptr as a valid “default stream” handle (e.g., sync paths call stream_synchronize(nullptr)). This makes Stream(0).synchronize() a no-op on CUDA, so wrappers around external/default streams (such as interop stream pointers) can return before queued work is finished.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-02T16:25:45Z

+            prog = self._prog_ref()
+            if prog is not None:
+                return prog
+        return impl.get_runtime().prog


Avoid rebinding stale stream handles to a new runtime

When a stream was created from an older program instance, _prog_ref() can expire after qd.reset(). _prog() then falls back to impl.get_runtime().prog, so later destroy()/synchronize() calls may execute against a different program/context using a stale handle, which can trigger invalid-handle behavior instead of safe cleanup. This should either no-op or raise once the owning program is gone.

Useful? React with 👍 / 👎.

When _prog_ref is set but the weakref has expired (Program destroyed), _prog() now returns None instead of falling back to the current runtime. Active operations (synchronize, record, wait) raise RuntimeError; destroy silently no-ops and zeroes the handle. Also allow synchronize_stream(0) to sync the default stream in CUDA, matching cuStreamSynchronize(nullptr) semantics. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the 9 stream/event .def() bindings from export_lang.cpp into a new export_stream.cpp, following the existing export_math/export_misc pattern. Satisfies the feature-factorization check for the 1225-line export_lang.cpp. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-02T18:44:48Z

Coverage Report (`3b09331da`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	19
🟢 `python/quadrants/lang/kernel.py`	89%	456,599
🔴 `python/quadrants/lang/stream.py`	48%	1,3,6,10,17,21-22,25,29,31,35,38,50,52-58,60,63,67,74,78-79,82,86,88,91,94,99,104,108,120,122-128,130,133,137,144,151
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	98%	252,268,283-284

Diff coverage: 84% · Overall: 74% · 330 lines, 54 missing

Full annotated report

hughperkins · 2026-05-02T19:09:50Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b09331daf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T19:18:14Z

+            prog = self._prog()
+            if prog is not None:
+                prog.stream_destroy(self._handle)
+            self._handle = 0


Destroy orphaned CUDA streams instead of dropping the handle

When qd.reset() (or any Program teardown) happens before Stream.destroy(), _prog_ref() can be None; this branch skips prog.stream_destroy(...) but still zeroes _handle, so the underlying CUDA stream is never freed and cannot be reclaimed later. Because CUDA context is shared across Programs, repeated create/reset cycles can leak stream resources over time. Use the current runtime Program as a fallback destroy path for owned streams (_prog_ref is not None) rather than silently dropping the handle.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-02T19:18:14Z

+            prog = self._prog()
+            if prog is not None:
+                prog.event_destroy(self._handle)
+            self._handle = 0


Destroy orphaned CUDA events instead of dropping the handle

The same ownership gap exists for Event.destroy(): after a reset, _prog_ref() may be dead, so event_destroy is skipped but _handle is cleared. That loses the only reference without releasing the underlying CUDA event, which accumulates across reset-heavy workflows and cannot be cleaned up afterward. For owned events (_prog_ref is not None), fall back to the current runtime Program to destroy the event when the original Program object is gone.

Useful? React with 👍 / 👎.

When the owning Program has been collected (e.g. after qd.reset()), destroy() and __del__ now fall back to the current runtime's Program to free the underlying CUDA resource. This is safe because CUDAContext is a singleton, so stream/event handles remain valid across Programs. Prevents resource leaks in create/reset cycles. Co-authored-by: Cursor <cursoragent@cursor.com>

hughperkins · 2026-05-02T19:31:54Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48c3922aca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T19:40:12Z

+          CUDADriver::get_instance().memcpy_host_to_device_async((void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz,
+                                                                 active_stream);


Serialize temp-buffer allocs with custom stream copies

This copy is now enqueued on active_stream, but temporary buffers for external host arrays are still allocated/freed through allocator paths that use stream nullptr (default stream). Because create_stream() creates CU_STREAM_NON_BLOCKING streams, default-stream alloc/free operations are not implicitly ordered with this stream, so a kernel call with qd_stream= and NumPy/PyTorch host arrays can hit use-before-allocation/use-after-free races (illegal address/invalid value) on memory-pool-enabled CUDA devices. The temp allocation/copy/free sequence needs to share one stream or be linked with explicit events.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-02T19:40:12Z

+        prog = self._prog()
+        if prog is None:
+            try:
+                return impl.get_runtime().prog


Destroy stream/event with original program backend

When the owning program is gone (e.g. after qd.reset()), cleanup falls back to impl.get_runtime().prog. If the runtime was reinitialized on a non-CUDA arch, later destroy()/__del__ calls route through a non-CUDA Program and become no-ops for nonzero CUDA handles, leaking CUDA streams/events in long-lived processes that reconfigure backends. Cleanup should preserve and use the original CUDA-capable program/backend for handle destruction.

Useful? React with 👍 / 👎.

github-actions · 2026-05-02T20:39:57Z

Coverage Report (`48c3922ac`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	19
🟢 `python/quadrants/lang/kernel.py`	89%	456,599
🔴 `python/quadrants/lang/stream.py`	47%	1,3,6,10,17,21-22,25,29,31,35,38,46-49,52,63,65-71,73,76,80,87,91-92,95,99,101,104,107,112,117,121,129-132,135,146,148-154,156,159,163,170,177
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	98%	252,268,283-284

Diff coverage: 82% · Overall: 74% · 346 lines, 64 missing

Full annotated report

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-02T23:29:59Z

Coverage Report (`44ee707af`)

File	Coverage	Missing
🔴 `python/quadrants/lang/__init__.py`	0%	19
🟢 `python/quadrants/lang/_fast_caching/args_hasher.py`	100%
🟢 `python/quadrants/lang/_fast_caching/src_hasher.py`	100%
🟢 `python/quadrants/lang/kernel.py`	89%	456,599
🔴 `python/quadrants/lang/stream.py`	47%	1,3,6,10,17,21-22,25,29,31,35,38,46-49,52,63,65-71,73,76,80,87,91-92,95,99,101,104,107,112,117,121,129-132,135,146,148-154,156,159,163,170,177
🟢 `tests/python/quadrants/lang/fast_caching/test_args_hasher.py`	100%
🟢 `tests/python/test_cache.py`	100%
🟢 `tests/python/test_perf_dispatch.py`	100%
🟢 `tests/python/test_streams.py`	98%	252,268,283-284

Diff coverage: 82% · Overall: 74% · 357 lines, 64 missing

Full annotated report

hughperkins marked this pull request as draft March 11, 2026 23:52

hughperkins marked this pull request as ready for review March 12, 2026 01:19

hughperkins changed the title ~~[Perf] CUDA Streams 1: Add CUDA stream and event API~~ [Perf] Streams 1: Add CUDA stream and event API Mar 12, 2026

hughperkins marked this pull request as draft March 12, 2026 04:59

duburcqa reviewed Mar 15, 2026

View reviewed changes

hughperkins added 4 commits April 19, 2026 16:42

Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…

a3c682b

…c-1-cuda-streams Made-with: Cursor # Conflicts: # python/quadrants/lang/kernel.py # quadrants/python/export_lang.cpp # quadrants/rhi/cuda/cuda_context.cpp # quadrants/runtime/cuda/kernel_launcher.cpp

Apply clang-format

9be110d

Made-with: Cursor

Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…

14c3c22

…c-1-cuda-streams Made-with: Cursor # Conflicts: # quadrants/runtime/cuda/kernel_launcher.cpp

[Test] Exclude flaky test_perf_dispatch_python from Vulkan

d3cae3c

The pure-Python perf dispatch test is timing-sensitive and unreliable on the Vulkan software renderer in CI. The kernel variant of the same test still covers perf dispatch on Vulkan. Made-with: Cursor

hughperkins closed this Apr 25, 2026

hughperkins reopened this Apr 28, 2026

hughperkins force-pushed the hp/streams-quadrantsic-1-cuda-streams branch 4 times, most recently from 271f23d to f64c497 Compare April 28, 2026 15:41

[Doc] Add user guide for streams API

cd5b486

hughperkins force-pushed the hp/streams-quadrantsic-1-cuda-streams branch from f64c497 to cd5b486 Compare April 28, 2026 15:42

Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…

f036b46

…c-1-cuda-streams # Conflicts: # quadrants/rhi/cuda/cuda_context.h

hughperkins added 2 commits April 28, 2026 11:10

Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…

59c2627

…c-1-cuda-streams

Reflow stream.py docstrings to 120c line width

f2a2596

hughperkins and others added 2 commits May 2, 2026 01:28

Revert adstack sizer stream_synchronize

e20fe99

Autodiff+streams is now blocked at the Python level, so the adstack code path never runs on a non-default stream. Remove the unnecessary stream_synchronize we added in publish_adstack_metadata. Co-authored-by: Cursor <cursoragent@cursor.com>

hughperkins and others added 2 commits May 2, 2026 01:40

Add test for qd_stream + autodiff kernel error guard

f6fee4f

Co-authored-by: Cursor <cursoragent@cursor.com>

hughperkins and others added 5 commits May 2, 2026 07:55

Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams

de4d99d

Extract stream/event methods from program.cpp into program_stream.cpp

9fd8b7b

Move the 9 CUDA-only stream/event Program methods into a dedicated translation unit. The CMake glob on quadrants/program/* picks up the new file automatically. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix clang-format in program_stream.h

84ba5b0

Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

hughperkins and others added 2 commits May 2, 2026 09:18

Reflow comment in program_stream.h to 120-char width

7e10267

Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

hughperkins and others added 3 commits May 2, 2026 09:31

Fix clang-format in export_stream.cpp

3b09331

Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

Reflow _destroy_prog docstrings to 120-char width

44ee707

Co-authored-by: Cursor <cursoragent@cursor.com>



		class Event:
		"""Wraps a backend-specific GPU event for stream synchronization.

		CUDADriver::get_instance().memcpy_host_to_device_async((void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz,
		active_stream);

Conversation

hughperkins commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Mar 12, 2026

PR Review: Add CUDA Stream and Event API

Summary

Issues and Concerns

1. Thread-safety of CUDAContext::stream_ (High)

2. Synchronous memcpy_host_to_device not updated (Medium)

3. __del__ calling into runtime during interpreter shutdown (Medium)

4. Inconsistent parameter naming in Event API (Low)

5. stream_synchronize doesn't guard against handle=0 (Low)

Minor Observations

Verdict

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Coverage Report (f036b46f5)

Uh oh!

hughperkins commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Coverage Report (f6fee4fbd)

Uh oh!

hughperkins commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 2, 2026

Coverage Report (b1b4ee60b)

Uh oh!

hughperkins commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 2, 2026

Coverage Report (3b09331da)

Uh oh!

hughperkins commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

hughperkins commented Mar 11, 2026 •

edited

Loading

1. Thread-safety of `CUDAContext::stream_` (High)

2. Synchronous `memcpy_host_to_device` not updated (Medium)

3. `del` calling into runtime during interpreter shutdown (Medium)

4. Inconsistent parameter naming in `Event` API (Low)

5. `stream_synchronize` doesn't guard against handle=0 (Low)

duburcqa Mar 15, 2026 •

edited

Loading

Coverage Report (`f036b46f5`)

Coverage Report (`f6fee4fbd`)

Coverage Report (`b1b4ee60b`)

Coverage Report (`3b09331da`)

Coverage Report (`48c3922ac`)

Coverage Report (`44ee707af`)