Skip to content

[Perf] Streams 1: Add CUDA stream and event API#407

Open
hughperkins wants to merge 46 commits intomainfrom
hp/streams-quadrantsic-1-cuda-streams
Open

[Perf] Streams 1: Add CUDA stream and event API#407
hughperkins wants to merge 46 commits intomainfrom
hp/streams-quadrantsic-1-cuda-streams

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Mar 11, 2026

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

Lines of code added: +481 - 197 - 4 - 4 = +276

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Introduces qd.create_stream() and qd.create_event() for launching
kernels on separate CUDA streams with event-based synchronization.
The qd_stream kwarg on kernel calls routes the launch to a specific
stream. Non-CUDA backends return no-op handles (0). Routes kernel
launcher memory ops through the active stream.
@hughperkins hughperkins marked this pull request as draft March 11, 2026 23:52
- Make CUDAContext::stream_ thread_local for thread-safety
- Convert sync memcpy_host_to_device to async on active_stream
- Use weakref in Stream/Event __del__ to safely handle interpreter shutdown
- Add __enter__/__exit__ context manager support for Stream and Event
- Use consistent qd_stream parameter naming in Event.record and Event.wait
- Add handle==0 guard to stream_synchronize
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Review from Opus (written before the last commit above):

PR Review: Add CUDA Stream and Event API

Branch: hp/streams-quadrantsic-1-cuda-streams
Commit: ab15b1b82 — "Add CUDA stream and event API for concurrent kernel execution"
Files changed: 11 (+443, -16)


Summary

This PR introduces a CUDA stream and event API to enable concurrent kernel execution on separate GPU streams. It adds:

  • Python API (stream.py): Stream and Event wrapper classes with create_stream() / create_event() factory functions
  • C++ backend (program.cpp/h): 9 new Program methods wrapping CUDA driver calls for stream/event lifecycle
  • Kernel launch integration (kernel.py): A qd_stream= kwarg on kernel calls that sets the active CUDA stream around launch
  • Kernel launcher fix (kernel_launcher.cpp): Replaces hardcoded nullptr stream with CUDAContext::get_instance().get_stream() so that async memory ops respect the active stream
  • pybind11 exports and tests (197 lines of test coverage)

The design is clean and well-layered. On non-CUDA backends, everything degrades to no-ops (handle=0).


Issues and Concerns

1. Thread-safety of CUDAContext::stream_ (High)

CUDAContext is a singleton. The set_stream / get_stream methods read/write a bare void *stream_ with no synchronization:

// quadrants/rhi/cuda/cuda_context.h:116-118
void set_stream(void *stream) {
    stream_ = stream;
}

The context already has a std::mutex lock_ (line 27), but it is not used here. The kernel launch path in kernel.py does set_current_cuda_stream → launch_kernel → set_current_cuda_stream(0), which is not atomic. If two Python threads launch kernels on different streams, they'll race on stream_, and a kernel could be launched on the wrong stream.

Suggestion: Either protect stream_ with the existing mutex, use thread_local storage for the active stream, or document that concurrent multi-threaded kernel launches with different streams are unsupported.

2. Synchronous memcpy_host_to_device not updated (Medium)

In kernel_launcher.cpp, the PR correctly updates all async operations to use active_stream, but lines 90–101 still use synchronous memcpy_host_to_device for external host array transfers:

// quadrants/runtime/cuda/kernel_launcher.cpp:90-91
CUDADriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

Synchronous cuMemcpyHtoD implicitly synchronizes the default stream. When a user launches on a non-default stream with host-backed external arrays, this will introduce unintended synchronization with the default stream, potentially defeating the purpose of using separate streams.

Suggestion: Convert these to memcpy_host_to_device_async on active_stream, consistent with the rest of the changes.

3. __del__ calling into runtime during interpreter shutdown (Medium)

Both Stream.__del__ and Event.__del__ call self.destroy(), which accesses impl.get_runtime().prog:

# python/quadrants/lang/stream.py:31-36
def __del__(self):
    if self._handle != 0:
        try:
            self.destroy()
        except Exception:
            pass

During Python interpreter shutdown, the runtime/program may already be finalized. The bare except Exception: pass mitigates crashes, but leaked CUDA resources are still possible. Additionally, __del__ timing is non-deterministic — the CUDA context itself could be destroyed before these finalizers run.

Suggestion: Consider registering streams/events with the runtime for batch cleanup at shutdown, or use weakref.ref to the program to detect whether cleanup is still possible. At minimum, add a note in the docstring encouraging explicit destroy() calls or context manager usage.

4. Inconsistent parameter naming in Event API (Low)

Event.record uses stream as its parameter name, while Event.wait uses qd_stream:

# python/quadrants/lang/stream.py:52-53
def record(self, stream: Stream | None = None):
    """Record this event on a stream. None means the default stream."""
# python/quadrants/lang/stream.py:58-59
def wait(self, qd_stream: Stream | None = None):
    """Make a stream wait for this event. None means the default stream."""

I understand qd_stream is used to avoid colliding with the kernel **kwargs namespace, but that concern doesn't apply to Event.wait — it's a standalone method, not a kernel call. These should be consistent. I'd suggest using stream in both places since the qd_stream prefix is an implementation detail of the kernel dispatch path.

5. stream_synchronize doesn't guard against handle=0 (Low)

stream_destroy and event_destroy guard against handle == 0, but stream_synchronize does not:

// quadrants/program/program.cpp:527-533
void Program::stream_synchronize(uint64 stream_handle) {
#ifdef QD_WITH_CUDA
  if (compile_config().arch == Arch::cuda) {
    CUDADriver::get_instance().stream_synchronize(
        reinterpret_cast<void *>(stream_handle));
  }
#endif

Calling cuStreamSynchronize(nullptr) synchronizes the default stream, which is technically valid but may be unintentional. The Python Stream class with handle == 0 (CPU backend) will reach this code path and unintentionally synchronize the CUDA default stream if the arch somehow doesn't match. The guard pattern should be consistent across all methods.


Minor Observations

  • Test coverage is solid. The test suite covers lifecycle, multi-stream, event sync, CPU no-op, and ndarray. Consider adding a test for using a destroyed stream (expect graceful no-op or error) and a test for multiple event records on the same event.

  • No context manager support. Stream would benefit from __enter__/__exit__ so users can write with qd.create_stream() as s: for automatic cleanup. This is a nice-to-have for a follow-up.

  • test_kernel_on_stream and similar tests use @test_utils.test() (all arches) which means they run on CPU where create_stream() returns handle=0 and the stream kwarg is a no-op. This tests the no-crash path, which is fine, but the concurrency behavior is only actually tested when arch=[qd.cuda]. Consider being explicit about which tests are CUDA-specific correctness tests vs. CPU-graceful-degradation tests.


Verdict

The PR is well-structured and the API is clean. The kernel launcher changes correctly propagating the active stream to all async memory operations are the most important part and look correct. I'd recommend addressing the thread-safety concern (#1) before merge — even if only with documentation — and fixing the synchronous memcpy (#2) to avoid subtle correctness issues when users mix host arrays with non-default streams. The naming inconsistency (#4) is a quick fix worth doing now to avoid API churn later.

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 01:19
@hughperkins hughperkins changed the title [Perf] CUDA Streams 1: Add CUDA stream and event API [Perf] Streams 1: Add CUDA stream and event API Mar 12, 2026
@hughperkins hughperkins marked this pull request as draft March 12, 2026 04:59
Comment thread python/quadrants/lang/stream.py Outdated
"""Wraps a backend-specific GPU stream for concurrent kernel execution.

On backends without native streams (e.g. CPU), this is a no-op object.
Call destroy() explicitly or use as a context manager to ensure cleanup.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather pretend it can only be used as context manager, aligning with the API for torch.profiler. Because managing streams manually without context sounds a bad practice and should be made easy.



class Event:
"""Wraps a backend-specific GPU event for stream synchronization.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what is an "event" in the documentation? I have no idea what it is.

Comment thread python/quadrants/lang/stream.py Outdated
Comment on lines +88 to +94
if self._handle != 0:
prog = impl.get_runtime().prog
prog.event_destroy(self._handle)
self._handle = 0

def __del__(self):
if self._handle != 0 and self._prog_ref is not None:
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer if self._handle:. It is more clear semantically. Whether it is an int or some more complex object does not matter much.

Comment thread quadrants/python/export_lang.cpp Outdated
Comment on lines +499 to +507
.def("stream_create", &Program::stream_create)
.def("stream_destroy", &Program::stream_destroy)
.def("stream_synchronize", &Program::stream_synchronize)
.def("set_current_cuda_stream", &Program::set_current_cuda_stream)
.def("event_create", &Program::event_create)
.def("event_destroy", &Program::event_destroy)
.def("event_record", &Program::event_record)
.def("event_synchronize", &Program::event_synchronize)
.def("stream_wait_event", &Program::stream_wait_event);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is cuda-specific and what is not? Only 'set_current_cuda_stream' is cuda specific? if so, stream are still usable on other backend or this function is necessary to make it useful?


// Stream management
PER_CUDA_FUNCTION(stream_create, cuStreamCreate, void **, uint32);
PER_CUDA_FUNCTION(stream_destroy, cuStreamDestroy_v2, void *);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 'cuStreamDestroy_v2' ? very weird name.

Why do we have functions with '_v2' suffix at multiple places?

Comment on lines 219 to +249
@@ -242,11 +242,11 @@ def fun(value: qd.types.ndarray(), offset: qd.template()):
qd_init_same_arch(offline_cache_file_path=str(tmp_path), offline_cache=True)
is_valid = False

def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args):
def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=None):
nonlocal is_valid
is_valid = True
assert compiled_kernel_data is not None
return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args)
return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=qd_stream)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather follow the existing pattern and move 'qd_stream' before *args.

Moreover, I see no reason to prefix stream with qd. What does it mean? This is quadrants projects, so of course it is related to quadrants. It is just a gpu stream no? Just to clarify it is a gpu computation stream, not just some random stream? I don't think it is necessary, you are passing this to functions like 'launch_kernel', of course it is about launching kernels.

…c-1-cuda-streams

Made-with: Cursor

# Conflicts:
#	python/quadrants/lang/kernel.py
#	quadrants/python/export_lang.cpp
#	quadrants/rhi/cuda/cuda_context.cpp
#	quadrants/runtime/cuda/kernel_launcher.cpp
Made-with: Cursor
…c-1-cuda-streams

Made-with: Cursor

# Conflicts:
#	quadrants/runtime/cuda/kernel_launcher.cpp
The pure-Python perf dispatch test is timing-sensitive and unreliable on
the Vulkan software renderer in CI. The kernel variant of the same test
still covers perf dispatch on Vulkan.

Made-with: Cursor
@hughperkins
Copy link
Copy Markdown
Collaborator Author

migrated to use single PR on streams 4

@hughperkins hughperkins reopened this Apr 28, 2026
@hughperkins hughperkins force-pushed the hp/streams-quadrantsic-1-cuda-streams branch 4 times, most recently from 271f23d to f64c497 Compare April 28, 2026 15:41
@hughperkins hughperkins force-pushed the hp/streams-quadrantsic-1-cuda-streams branch from f64c497 to cd5b486 Compare April 28, 2026 15:42
…c-1-cuda-streams

# Conflicts:
#	quadrants/rhi/cuda/cuda_context.h
@github-actions
Copy link
Copy Markdown

Coverage Report (f036b46f5)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 18
🟢 python/quadrants/lang/kernel.py 90% 447
🔴 python/quadrants/lang/stream.py 40% 1,3,6,10,17,21-22,25,30,37,39-45,47-48,50-51,54,61,65-66,69,75,81,83-84,86,93,95-101,103-104,106-107,110,117,124
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 100%

Diff coverage: 80% · Overall: 73% · 238 lines, 48 missing

Full annotated report

hughperkins and others added 2 commits May 2, 2026 01:28
Streams are not compatible with reverse-mode or forward-mode
differentiation.  The adstack sizer and Tape replay paths assume the
default stream; rather than fixing every race, block the combination
at the Python entry point with a clear error message.

Co-authored-by: Cursor <cursoragent@cursor.com>
Autodiff+streams is now blocked at the Python level, so the adstack
code path never runs on a non-default stream.  Remove the unnecessary
stream_synchronize we added in publish_adstack_metadata.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@claude review

hughperkins and others added 2 commits May 2, 2026 01:40
Our branch had a stale copy of publish_adstack_metadata and
ensure_adstack_heap that conflicted with upstream's refactor into
ensure_adstack_heap_float / ensure_adstack_heap_int.  Since autodiff
is now blocked with streams at the Python level, we have no changes
to make in this file.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (f6fee4fbd)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 48% 1,3,6,10,17,21-22,25,30,40,42-48,50,53,57,64,68-69,72,78,84,89,99,101-107,109,112,116,123,130
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 252,268,283-284

Diff coverage: 85% · Overall: 67% · 316 lines, 47 missing

Full annotated report

hughperkins and others added 5 commits May 2, 2026 07:55
Move the 9 CUDA-only stream/event Program methods into a dedicated
translation unit.  The CMake glob on quadrants/program/* picks up
the new file automatically.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the CUDA stream/event logic into a StreamManager class
(program_stream.h/.cpp).  Program keeps its public API unchanged
and delegates to stream_manager_ internally, so the pybind layer
and Python code need no changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add stream_manager() accessor to Program.  Update export_lang.cpp
to call StreamManager methods through lambdas.  Delete the 9
one-line delegation methods from Program — the declarations in
program.h and definitions in program_stream.cpp are both gone.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1b4ee60b2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/quadrants/lang/stream.py Outdated
Comment on lines +35 to +37
if self._handle != 0 and self._prog_ref is not None:
prog = impl.get_runtime().prog
prog.stream_destroy(self._handle)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Destroy streams via their owning Program

Stream.destroy() uses impl.get_runtime().prog instead of the stream’s captured program reference, so after qd.reset() (or any runtime swap) it can raise _prog attribute not initialized or call into the wrong backend, leaving the original CUDA stream undeleted; this breaks the method’s “safe to call multiple times” contract and can leak stream resources across resets. The same ownership bug pattern also exists in Event.destroy().

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (b1b4ee60b)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 48% 1,3,6,10,17,21-22,25,30,40,42-48,50,53,57,64,68-69,72,78,84,89,99,101-107,109,112,116,123,130
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 252,268,283-284

Diff coverage: 85% · Overall: 74% · 316 lines, 47 missing

Full annotated report

hughperkins and others added 2 commits May 2, 2026 09:18
Co-authored-by: Cursor <cursoragent@cursor.com>
All methods on Stream and Event now resolve the Program through the
captured weakref first, falling back to the current runtime only for
externally-wrapped handles.  Fixes a bug where destroy/synchronize/
record/wait would call into the wrong Program after qd.reset().

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 614c742cd9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/program/program_stream.cpp Outdated
Comment on lines +39 to +41
if (arch_ == Arch::cuda && stream_handle != 0) {
CUDAContext::get_instance().make_current();
CUDADriver::get_instance().stream_synchronize(reinterpret_cast<void *>(stream_handle));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Synchronize default-stream wrappers instead of no-op

synchronize_stream() currently skips handle 0, but the CUDA launcher elsewhere treats nullptr as a valid “default stream” handle (e.g., sync paths call stream_synchronize(nullptr)). This makes Stream(0).synchronize() a no-op on CUDA, so wrappers around external/default streams (such as interop stream pointers) can return before queued work is finished.

Useful? React with 👍 / 👎.

prog = self._prog_ref()
if prog is not None:
return prog
return impl.get_runtime().prog
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rebinding stale stream handles to a new runtime

When a stream was created from an older program instance, _prog_ref() can expire after qd.reset(). _prog() then falls back to impl.get_runtime().prog, so later destroy()/synchronize() calls may execute against a different program/context using a stale handle, which can trigger invalid-handle behavior instead of safe cleanup. This should either no-op or raise once the owning program is gone.

Useful? React with 👍 / 👎.

hughperkins and others added 3 commits May 2, 2026 09:31
When _prog_ref is set but the weakref has expired (Program destroyed),
_prog() now returns None instead of falling back to the current runtime.
Active operations (synchronize, record, wait) raise RuntimeError; destroy
silently no-ops and zeroes the handle.

Also allow synchronize_stream(0) to sync the default stream in CUDA,
matching cuStreamSynchronize(nullptr) semantics.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the 9 stream/event .def() bindings from export_lang.cpp into a
new export_stream.cpp, following the existing export_math/export_misc
pattern.  Satisfies the feature-factorization check for the 1225-line
export_lang.cpp.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (3b09331da)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 48% 1,3,6,10,17,21-22,25,29,31,35,38,50,52-58,60,63,67,74,78-79,82,86,88,91,94,99,104,108,120,122-128,130,133,137,144,151
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 252,268,283-284

Diff coverage: 84% · Overall: 74% · 330 lines, 54 missing

Full annotated report

@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b09331daf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/quadrants/lang/stream.py Outdated
Comment on lines +45 to +48
prog = self._prog()
if prog is not None:
prog.stream_destroy(self._handle)
self._handle = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Destroy orphaned CUDA streams instead of dropping the handle

When qd.reset() (or any Program teardown) happens before Stream.destroy(), _prog_ref() can be None; this branch skips prog.stream_destroy(...) but still zeroes _handle, so the underlying CUDA stream is never freed and cannot be reclaimed later. Because CUDA context is shared across Programs, repeated create/reset cycles can leak stream resources over time. Use the current runtime Program as a fallback destroy path for owned streams (_prog_ref is not None) rather than silently dropping the handle.

Useful? React with 👍 / 👎.

Comment thread python/quadrants/lang/stream.py Outdated
Comment on lines +115 to +118
prog = self._prog()
if prog is not None:
prog.event_destroy(self._handle)
self._handle = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Destroy orphaned CUDA events instead of dropping the handle

The same ownership gap exists for Event.destroy(): after a reset, _prog_ref() may be dead, so event_destroy is skipped but _handle is cleared. That loses the only reference without releasing the underlying CUDA event, which accumulates across reset-heavy workflows and cannot be cleaned up afterward. For owned events (_prog_ref is not None), fall back to the current runtime Program to destroy the event when the original Program object is gone.

Useful? React with 👍 / 👎.

When the owning Program has been collected (e.g. after qd.reset()),
destroy() and __del__ now fall back to the current runtime's Program
to free the underlying CUDA resource.  This is safe because CUDAContext
is a singleton, so stream/event handles remain valid across Programs.
Prevents resource leaks in create/reset cycles.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48c3922aca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +253 to +254
CUDADriver::get_instance().memcpy_host_to_device_async((void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz,
active_stream);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize temp-buffer allocs with custom stream copies

This copy is now enqueued on active_stream, but temporary buffers for external host arrays are still allocated/freed through allocator paths that use stream nullptr (default stream). Because create_stream() creates CU_STREAM_NON_BLOCKING streams, default-stream alloc/free operations are not implicitly ordered with this stream, so a kernel call with qd_stream= and NumPy/PyTorch host arrays can hit use-before-allocation/use-after-free races (illegal address/invalid value) on memory-pool-enabled CUDA devices. The temp allocation/copy/free sequence needs to share one stream or be linked with explicit events.

Useful? React with 👍 / 👎.

prog = self._prog()
if prog is None:
try:
return impl.get_runtime().prog
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Destroy stream/event with original program backend

When the owning program is gone (e.g. after qd.reset()), cleanup falls back to impl.get_runtime().prog. If the runtime was reinitialized on a non-CUDA arch, later destroy()/__del__ calls route through a non-CUDA Program and become no-ops for nonzero CUDA handles, leaking CUDA streams/events in long-lived processes that reconfigure backends. Cleanup should preserve and use the original CUDA-capable program/backend for handle destruction.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (48c3922ac)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 47% 1,3,6,10,17,21-22,25,29,31,35,38,46-49,52,63,65-71,73,76,80,87,91-92,95,99,101,104,107,112,117,121,129-132,135,146,148-154,156,159,163,170,177
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 252,268,283-284

Diff coverage: 82% · Overall: 74% · 346 lines, 64 missing

Full annotated report

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Coverage Report (44ee707af)

File Coverage Missing
🔴 python/quadrants/lang/__init__.py 0% 19
🟢 python/quadrants/lang/_fast_caching/args_hasher.py 100%
🟢 python/quadrants/lang/_fast_caching/src_hasher.py 100%
🟢 python/quadrants/lang/kernel.py 89% 456,599
🔴 python/quadrants/lang/stream.py 47% 1,3,6,10,17,21-22,25,29,31,35,38,46-49,52,63,65-71,73,76,80,87,91-92,95,99,101,104,107,112,117,121,129-132,135,146,148-154,156,159,163,170,177
🟢 tests/python/quadrants/lang/fast_caching/test_args_hasher.py 100%
🟢 tests/python/test_cache.py 100%
🟢 tests/python/test_perf_dispatch.py 100%
🟢 tests/python/test_streams.py 98% 252,268,283-284

Diff coverage: 82% · Overall: 74% · 357 lines, 64 missing

Full annotated report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants