-
Notifications
You must be signed in to change notification settings - Fork 17
[Perf] Streams 1: Add CUDA stream and event API #407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ab15b1b
b856b33
a3c682b
9be110d
14c3c22
d3cae3c
cd5b486
f036b46
59c2627
f2a2596
de99f3e
d6876da
401d6f8
a3c98f8
ca14f67
b46de06
b9eef6e
f0dd7d6
8b3d4ed
aa4a70f
5901a7f
8550aa0
6374cf3
ca8ace3
b1c6eea
3c6b24e
c549e07
5d284ac
ff8056d
c9c75bd
360adc8
e20fe99
e3c5f6f
f6fee4f
de4d99d
9fd8b7b
9e6f865
84ba5b0
b1b4ee6
7e10267
614c742
3dad35a
bebc904
3b09331
48c3922
44ee707
c6278ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -57,6 +57,7 @@ tile16 | |
|
|
||
| fastcache | ||
| graph | ||
| streams | ||
| perf_dispatch | ||
| init_options | ||
| ``` | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| # Streams | ||
|
|
||
| Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. By creating explicit streams, you can run independent kernels concurrently and control synchronization with events. | ||
|
|
||
| ## Supported platforms | ||
|
|
||
| | Backend | Streams | Events | Notes | | ||
| |---------|---------|--------|-------| | ||
| | CUDA | Yes | Yes | Full concurrent execution | | ||
| | CPU | No-op | No-op | `qd_stream` is silently ignored, kernels run serially | | ||
| | Metal | No-op | No-op | `qd_stream` is silently ignored, kernels run serially | | ||
| | Vulkan | No-op | No-op | `qd_stream` is silently ignored, kernels run serially | | ||
|
|
||
| On backends without native stream support, `create_stream()` and `create_event()` return objects with handle `0`. All stream/event operations become no-ops and kernels run serially. Code written with streams is portable across all backends in the sense that it will run without modifications, but serially. | ||
|
|
||
| ## Creating and using streams | ||
|
|
||
| ```python | ||
| import quadrants as qd | ||
|
|
||
| qd.init(arch=qd.cuda) | ||
|
|
||
| N = 1024 | ||
| a = qd.field(qd.f32, shape=(N,)) | ||
| b = qd.field(qd.f32, shape=(N,)) | ||
|
|
||
| @qd.kernel | ||
| def fill_a(): | ||
| for i in range(N): | ||
| a[i] = 1.0 | ||
|
|
||
| @qd.kernel | ||
| def fill_b(): | ||
| for i in range(N): | ||
| b[i] = 2.0 | ||
|
|
||
| s1 = qd.create_stream() | ||
| s2 = qd.create_stream() | ||
|
|
||
| fill_a(qd_stream=s1) | ||
| fill_b(qd_stream=s2) | ||
|
|
||
| s1.synchronize() | ||
| s2.synchronize() | ||
|
|
||
| s1.destroy() | ||
| s2.destroy() | ||
| ``` | ||
|
|
||
| Pass `qd_stream=` to any kernel call to launch it on that stream. Kernels on different streams may execute concurrently. Call `synchronize()` to block until all work on a stream completes. | ||
|
|
||
| ## Events | ||
|
|
||
| Events let you express dependencies between streams without full synchronization. | ||
|
|
||
| ```python | ||
| s1 = qd.create_stream() | ||
| s2 = qd.create_stream() | ||
|
|
||
| @qd.kernel | ||
| def produce(): | ||
| for i in range(N): | ||
| a[i] = 10.0 | ||
|
|
||
| @qd.kernel | ||
| def consume(): | ||
| for i in range(N): | ||
| b[i] = a[i] | ||
|
|
||
| produce(qd_stream=s1) | ||
|
|
||
| e = qd.create_event() | ||
| e.record(s1) # record when s1 finishes produce() | ||
| e.wait(qd_stream=s2) # s2 waits for that event before proceeding | ||
|
|
||
| consume(qd_stream=s2) # safe to read a[] — produce() is guaranteed complete | ||
| s2.synchronize() | ||
|
|
||
| e.destroy() | ||
| s1.destroy() | ||
| s2.destroy() | ||
| ``` | ||
|
|
||
| `e.record(stream)` captures the point in `stream`'s execution. `e.wait(qd_stream=stream)` makes `stream` wait until the recorded point is reached. If `qd_stream` is omitted, the default stream waits. | ||
|
|
||
| ## Context managers | ||
|
|
||
| Streams and events support `with` blocks for automatic cleanup: | ||
|
|
||
| ```python | ||
| with qd.create_stream() as s: | ||
| fill_a(qd_stream=s) | ||
| s.synchronize() | ||
| # s.destroy() called automatically | ||
| ``` | ||
|
|
||
| ## PyTorch interop (CUDA) | ||
|
|
||
| When mixing Quadrants kernels with PyTorch operations on CUDA, both frameworks must use the same stream to avoid race conditions. Without explicit stream management, Quadrants and PyTorch may launch work on different streams with no ordering guarantees, leading to intermittent data corruption. | ||
|
|
||
| ### Running Quadrants kernels on PyTorch's stream | ||
|
|
||
| ```python | ||
| import torch | ||
| from quadrants.lang.stream import Stream | ||
|
|
||
| torch_stream_ptr = torch.cuda.current_stream().cuda_stream | ||
| stream = Stream(torch_stream_ptr) | ||
|
Comment on lines
+105
to
+108
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The PyTorch interop example in streams.md uses Extended reasoning...What the bug isIn import torch
from quadrants.lang.stream import Stream
torch_stream_ptr = torch.cuda.current_stream().cuda_stream
stream = Stream(torch_stream_ptr)This imports Why
|
||
|
|
||
| physics_kernel(qd_stream=stream) | ||
| observations = compute_obs_tensor() # PyTorch op on the same stream | ||
| apply_actions_kernel(qd_stream=stream) | ||
| ``` | ||
|
|
||
| Wrap PyTorch's raw `CUstream` pointer in a Quadrants `Stream` object. Do **not** call `destroy()` on this wrapper — PyTorch owns the underlying stream. | ||
|
|
||
| ### Running PyTorch operations on a Quadrants stream | ||
|
|
||
| ```python | ||
| qd_stream = qd.create_stream() | ||
| torch_stream = torch.cuda.ExternalStream(qd_stream.handle) | ||
|
|
||
| with torch.cuda.stream(torch_stream): | ||
| physics_kernel(qd_stream=qd_stream) | ||
| observations = compute_obs_tensor() | ||
| apply_actions_kernel(qd_stream=qd_stream) | ||
|
|
||
| qd_stream.destroy() | ||
| ``` | ||
|
|
||
| `Stream.handle` is the raw `CUstream` pointer, which `torch.cuda.ExternalStream` accepts directly. | ||
|
|
||
| ## Limitations | ||
|
|
||
| - **Not compatible with graphs.** Do not pass `qd_stream` to a kernel decorated with `graph=True`. | ||
| - **Not compatible with autodiff.** Do not pass `qd_stream` to a kernel that uses reverse-mode or forward-mode differentiation, or inside a `qd.ad.Tape` context. | ||
| - **`qd.sync()` only waits on the default stream.** It does not drain explicit streams. Call `stream.synchronize()` on each stream you need to wait for. | ||
| - **No automatic synchronization.** You are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -453,7 +453,9 @@ def materialize(self, key: "CompiledKernelKeyType | None", py_args: tuple[Any, . | |
| ] | ||
| runtime._current_global_context = None | ||
|
|
||
| def launch_kernel(self, key, t_kernel: KernelCxx, compiled_kernel_data: CompiledKernelData | None, *args) -> Any: | ||
| def launch_kernel( | ||
| self, key, t_kernel: KernelCxx, compiled_kernel_data: CompiledKernelData | None, *args, qd_stream=None | ||
| ) -> Any: | ||
| assert len(args) == len(self.arg_metas), f"{len(self.arg_metas)} arguments needed but {len(args)} provided" | ||
|
|
||
| callbacks: list[Callable[[], None]] = [] | ||
|
|
@@ -567,9 +569,21 @@ def launch_kernel(self, key, t_kernel: KernelCxx, compiled_kernel_data: Compiled | |
| self.src_ll_cache_observations.cache_stored = True | ||
| self._last_compiled_kernel_data = compiled_kernel_data | ||
| launch_ctx.use_graph = self.use_graph and _GRAPH_ENABLED | ||
| if self.use_graph and qd_stream is not None: | ||
| raise RuntimeError( | ||
| "qd_stream is not compatible with graph=True kernels. " | ||
| "See docs/source/user_guide/streams.md for details." | ||
| ) | ||
|
claude[bot] marked this conversation as resolved.
|
||
| if self.graph_do_while_arg is not None and hasattr(self, "_graph_do_while_cpp_arg_id"): | ||
| launch_ctx.graph_do_while_arg_id = self._graph_do_while_cpp_arg_id | ||
| prog.launch_kernel(compiled_kernel_data, launch_ctx) | ||
| stream_handle = qd_stream.handle if qd_stream is not None else 0 | ||
| if stream_handle: | ||
| prog.set_current_cuda_stream(stream_handle) | ||
| try: | ||
| prog.launch_kernel(compiled_kernel_data, launch_ctx) | ||
| finally: | ||
| if stream_handle: | ||
| prog.set_current_cuda_stream(0) | ||
| except Exception as e: | ||
| e = handle_exception_from_cpp(e) | ||
| if impl.get_runtime().print_full_traceback: | ||
|
|
@@ -581,6 +595,8 @@ def launch_kernel(self, key, t_kernel: KernelCxx, compiled_kernel_data: Compiled | |
|
|
||
| return_type = self.return_type | ||
| if return_type or self.has_print: | ||
| if qd_stream is not None and self.has_print and not return_type: | ||
| qd_stream.synchronize() | ||
| runtime_ops.sync() | ||
|
|
||
| if not return_type: | ||
|
|
@@ -647,6 +663,17 @@ def ensure_compiled(self, *py_args: tuple[Any, ...]) -> tuple[Callable, int, Aut | |
| # Thus this part needs to be fast. (i.e. < 3us on a 4 GHz x64 CPU) | ||
| @_shell_pop_print | ||
| def __call__(self, *py_args, **kwargs) -> Any: | ||
| qd_stream = kwargs.pop("qd_stream", None) | ||
| if qd_stream is not None and self.autodiff_mode != _NONE: | ||
| raise RuntimeError( | ||
| "qd_stream is not compatible with autodiff kernels. Streams cannot be used with " | ||
| "reverse-mode or forward-mode differentiation." | ||
| ) | ||
| if qd_stream is not None and self.runtime.target_tape: | ||
| raise RuntimeError( | ||
| "qd_stream is not compatible with autograd Tape. Launch the kernel outside the Tape " | ||
| "context, or omit qd_stream." | ||
| ) | ||
|
Comment on lines
+666
to
+676
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 FwdMode + qd_stream is not rejected at Extended reasoning...What the bug is
The exact code path that triggers it
Step-by-step proofimport quadrants as qd
qd.init(arch=qd.cuda)
x = qd.field(qd.f32, shape=(), needs_grad=True)
loss = qd.field(qd.f32, shape=(), needs_grad=True)
@qd.kernel
def f():
loss[None] = x[None] * x[None]
x[None] = 3.0
s = qd.create_stream() # CU_STREAM_NON_BLOCKING
with qd.ad.FwdMode(loss=loss, param=x, seed=[1.0]):
# __enter__ ran:
# - clear_all_gradients(DUAL) → x.dual[None] = 0.0 written on NULL stream
# - then x.dual[None] = 1.0 written on NULL stream via write_float
# - write_float pre-syncs but does NOT post-sync; writer is in flight
f(qd_stream=s)
# FORWARD-transformed kernel queues on s (non-blocking).
# No ordering with NULL — forward reads x.dual before writer retires.
# Reads stale 0.0 instead of 1.0 → JVP computed against zero seed.
s.synchronize()
print(loss.grad[None]) # Expected ~6.0 (= 2*x*1.0). Race → may be 0.0.The trailing kernel_launcher sync at lines 266-268 is the user's forward kernel's own sync (it drains Mirror race in Why existing safeguards do not cover it
Why pre-PR was correctBefore this PR, all kernel launches went on the NULL stream, so the seed writer (on NULL) and the forward kernel (also on NULL) were sequentially ordered through the same stream. No race was possible. The PR introduces How to fixMirror the Tape rejection at if qd_stream is not None and (self.runtime.target_tape or self.runtime.fwd_mode_manager):
raise RuntimeError(
"qd_stream is not compatible with autograd Tape / FwdMode. ..."
)And add Alternatively, synchronize the seed write inside |
||
| if impl.get_runtime()._arch == _ARCH_PYTHON: | ||
| return self.func(*py_args, **kwargs) | ||
| config = impl.current_cfg() | ||
|
claude[bot] marked this conversation as resolved.
|
||
|
|
@@ -709,7 +736,7 @@ def __call__(self, *py_args, **kwargs) -> Any: | |
| kernel_cpp = self.materialized_kernels[key] | ||
| compiled_kernel_data = self.compiled_kernel_data_by_key.get(key, None) | ||
| self.launch_observations.found_kernel_in_materialize_cache = compiled_kernel_data is not None | ||
| ret = self.launch_kernel(key, kernel_cpp, compiled_kernel_data, *py_args) | ||
| ret = self.launch_kernel(key, kernel_cpp, compiled_kernel_data, *py_args, qd_stream=qd_stream) | ||
| if compiled_kernel_data is None: | ||
| assert self._last_compiled_kernel_data is not None | ||
| self.compiled_kernel_data_by_key[key] = self._last_compiled_kernel_data | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The streams.md compatibility table at lines 7-12 lists CUDA, CPU, Metal, and Vulkan but omits AMDGPU, even though it is a public arch (
qd.amdgpu, exported viapython/quadrants/lang/misc.pyand asserted intests/python/test_api.py). The newProgram::stream_create/event_createmethods atprogram.cpp:497-541only special-caseArch::cuda, so AMDGPU users fall through to thereturn 0no-op path just like CPU/Metal/Vulkan — without any documentation telling themqd_streamis silently a no-op on their backend. Fix is to add an AMDGPU row mirroring the existing no-op rows.Extended reasoning...
What the bug is
The new docs page
docs/source/user_guide/streams.mdintroduces a backend compatibility table:AMDGPU is missing entirely. It is a documented, public-facing arch in quadrants —
python/quadrants/lang/misc.py:121exposes it asqd.amdgpu, line 139 includes it inqd.gpu = [cuda, metal, vulkan, amdgpu],tests/python/test_api.pyasserts'amdgpu'is a public symbol, andArch::amdgpuis referenced across many C++ files (e.g.program.cpp:394andprogram.cpp:431for ndarray creation/data-ptr handling).Why it manifests as a no-op for AMDGPU users
Each of the new
Programstream/event methods atprogram.cpp:497-541(stream_create,stream_destroy,stream_synchronize,set_current_cuda_stream,event_create,event_destroy,event_record,event_synchronize,stream_wait_event) gates its body oncompile_config().arch == Arch::cuda:For
Arch::amdgputhe function falls through toreturn 0. The Python wrapper instream.pythen seeshandle == 0and treats the stream as a no-op:destroy()skips the C++ call (if self._handle != 0 and self._prog_ref is not None),synchronize()callsprog.stream_synchronize(0)which is also a no-op under the same gate, and the kernel-launch path inkernel.pyskipsset_current_cuda_streambecausestream_handleis0.This is the same behavior as CPU/Metal/Vulkan — but only those three are documented in the table. The PR description even says "Non-CUDA backends return no-op handles (0)" confirming this is intended. AMDGPU just got left out of the table.
Step-by-step proof
qd.init(arch=qd.amdgpu).streams.md, sees no AMDGPU row, and (reasonably) assumes either streams are supported (since AMDGPU is a GPU backend) or that it might error — neither is true.s = qd.create_stream(); kernel(qd_stream=s)and expects either GPU concurrency or an error.create_stream()→Program::stream_create()→compile_config().arch == Arch::cudais false →return 0.kernel(qd_stream=s)enterslaunch_kernelwithstream_handle = s.handle = 0. Theif stream_handle:guard atkernel.py:572is false, soset_current_cuda_streamis never called.Impact and fix
Documentation-only completeness gap with no runtime correctness impact. Severity is nit because the runtime behavior matches the other no-op backends — users just have no signpost telling them so for AMDGPU. The fix is one line in the table:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unaddressed in the latest commit. The compatibility table at streams.md:7-12 lists only CUDA, CPU, Metal, and Vulkan; no AMDGPU row has been added. AMDGPU users still have no docs signpost telling them
qd_streamis silently a no-op on their backend.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unaddressed in the latest commit (c549e07). The streams.md compatibility table (lines 7-12) still lists only CUDA, CPU, Metal, and Vulkan; no AMDGPU row has been added, so AMDGPU users still have no docs signpost telling them
qd_streamis silently a no-op on their backend.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still unaddressed in the latest commit (ff8056d). The compatibility table at streams.md:7-12 lists only CUDA, CPU, Metal, and Vulkan; no AMDGPU row has been added, so AMDGPU users still have no docs signpost telling them
qd_streamis silently a no-op on their backend.