Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
ab15b1b
Add CUDA stream and event API for concurrent kernel execution
hughperkins Mar 11, 2026
b856b33
Address review feedback for CUDA streams PR
hughperkins Mar 12, 2026
a3c682b
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 19, 2026
9be110d
Apply clang-format
hughperkins Apr 20, 2026
14c3c22
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 24, 2026
d3cae3c
[Test] Exclude flaky test_perf_dispatch_python from Vulkan
hughperkins Apr 24, 2026
cd5b486
[Doc] Add user guide for streams API
hughperkins Apr 28, 2026
f036b46
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 28, 2026
59c2627
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins Apr 28, 2026
f2a2596
Reflow stream.py docstrings to 120c line width
hughperkins Apr 28, 2026
de99f3e
Unwrap prose lines in streams.md to match repo doc style
hughperkins Apr 28, 2026
d6876da
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins May 1, 2026
401d6f8
Use CU_STREAM_NON_BLOCKING for user-created streams
hughperkins May 1, 2026
a3c98f8
Use async DtoH memcpy on active_stream for external array readback
hughperkins May 1, 2026
ca14f67
Guard destroy()/__exit__ against destroying externally-owned handles
hughperkins May 1, 2026
b46de06
Fix clang-format indentation for memcpy_device_to_host_async
hughperkins May 1, 2026
b9eef6e
Use async DtoH on active_stream for do-while loop counter readback
hughperkins May 1, 2026
f0dd7d6
Use active_stream for sizer device context staging
hughperkins May 1, 2026
8b3d4ed
Add make_current() to stream/event Program methods
hughperkins May 1, 2026
aa4a70f
Use async DtoH on active_stream for resolve_num_threads readback
hughperkins May 1, 2026
5901a7f
Sync active_stream at end of launch_llvm_kernel unconditionally
hughperkins May 1, 2026
8550aa0
Fix end-of-launcher sync: conditional + dealloc race
hughperkins May 1, 2026
6374cf3
Reject qd_stream inside autograd Tape context
hughperkins May 1, 2026
ca8ace3
Fix linter formatting; guard graph+stream; sync has_print on stream
hughperkins May 1, 2026
b1c6eea
Sync active_stream before adstack sizer stride readback
hughperkins May 1, 2026
3c6b24e
Add tests for stream/event context managers, event.synchronize, error…
hughperkins May 1, 2026
c549e07
Fix graph+stream error guard and test
hughperkins May 1, 2026
5d284ac
Update qd.sync() docstring and streams doc to reflect default-stream-…
hughperkins May 1, 2026
ff8056d
Reflow sync() docstring to 120-char line width
hughperkins May 1, 2026
c9c75bd
Merge remote-tracking branch 'origin/main' into hp/streams-quadrantsi…
hughperkins May 2, 2026
360adc8
Reject qd_stream on autodiff kernels
hughperkins May 2, 2026
e20fe99
Revert adstack sizer stream_synchronize
hughperkins May 2, 2026
e3c5f6f
Reset llvm_runtime_executor.cpp to upstream
hughperkins May 2, 2026
f6fee4f
Add test for qd_stream + autodiff kernel error guard
hughperkins May 2, 2026
de4d99d
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins May 2, 2026
9fd8b7b
Extract stream/event methods from program.cpp into program_stream.cpp
hughperkins May 2, 2026
9e6f865
Introduce StreamManager delegate class for stream/event ops
hughperkins May 2, 2026
84ba5b0
Fix clang-format in program_stream.h
hughperkins May 2, 2026
b1b4ee6
Remove Program wrapper methods, bind StreamManager directly via pybind
hughperkins May 2, 2026
7e10267
Reflow comment in program_stream.h to 120-char width
hughperkins May 2, 2026
614c742
Use captured prog_ref for all Stream/Event operations
hughperkins May 2, 2026
3dad35a
Fix stale handle safety in Stream/Event after qd.reset()
hughperkins May 2, 2026
bebc904
Extract stream/event pybind bindings into export_stream.cpp
hughperkins May 2, 2026
3b09331
Fix clang-format in export_stream.cpp
hughperkins May 2, 2026
48c3922
Fall back to current runtime for Stream/Event destroy after reset
hughperkins May 2, 2026
44ee707
Reflow _destroy_prog docstrings to 120-char width
hughperkins May 2, 2026
c6278ff
Merge branch 'main' into hp/streams-quadrantsic-1-cuda-streams
hughperkins May 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ tile16

fastcache
graph
streams
perf_dispatch
init_options
```
Expand Down
138 changes: 138 additions & 0 deletions docs/source/user_guide/streams.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Streams

Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. By creating explicit streams, you can run independent kernels concurrently and control synchronization with events.

## Supported platforms

| Backend | Streams | Events | Notes |
|---------|---------|--------|-------|
| CUDA | Yes | Yes | Full concurrent execution |
| CPU | No-op | No-op | `qd_stream` is silently ignored, kernels run serially |
| Metal | No-op | No-op | `qd_stream` is silently ignored, kernels run serially |
| Vulkan | No-op | No-op | `qd_stream` is silently ignored, kernels run serially |

Comment on lines +5 to +13
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The streams.md compatibility table at lines 7-12 lists CUDA, CPU, Metal, and Vulkan but omits AMDGPU, even though it is a public arch (qd.amdgpu, exported via python/quadrants/lang/misc.py and asserted in tests/python/test_api.py). The new Program::stream_create / event_create methods at program.cpp:497-541 only special-case Arch::cuda, so AMDGPU users fall through to the return 0 no-op path just like CPU/Metal/Vulkan — without any documentation telling them qd_stream is silently a no-op on their backend. Fix is to add an AMDGPU row mirroring the existing no-op rows.

Extended reasoning...

What the bug is

The new docs page docs/source/user_guide/streams.md introduces a backend compatibility table:

| Backend | Streams | Events | Notes |
|---------|---------|--------|-------|
| CUDA    | Yes     | Yes    | Full concurrent execution |
| CPU     | No-op   | No-op  | qd_stream is silently ignored, kernels run serially |
| Metal   | No-op   | No-op  | qd_stream is silently ignored, kernels run serially |
| Vulkan  | No-op   | No-op  | qd_stream is silently ignored, kernels run serially |

AMDGPU is missing entirely. It is a documented, public-facing arch in quadrants — python/quadrants/lang/misc.py:121 exposes it as qd.amdgpu, line 139 includes it in qd.gpu = [cuda, metal, vulkan, amdgpu], tests/python/test_api.py asserts 'amdgpu' is a public symbol, and Arch::amdgpu is referenced across many C++ files (e.g. program.cpp:394 and program.cpp:431 for ndarray creation/data-ptr handling).

Why it manifests as a no-op for AMDGPU users

Each of the new Program stream/event methods at program.cpp:497-541 (stream_create, stream_destroy, stream_synchronize, set_current_cuda_stream, event_create, event_destroy, event_record, event_synchronize, stream_wait_event) gates its body on compile_config().arch == Arch::cuda:

uint64 Program::stream_create() {
#ifdef QD_WITH_CUDA
  if (compile_config().arch == Arch::cuda) {
    CUDAContext::get_instance().make_current();
    void *stream = nullptr;
    CUDADriver::get_instance().stream_create(&stream, 0x1 /*CU_STREAM_NON_BLOCKING*/);
    return reinterpret_cast<uint64>(stream);
  }
#endif
  return 0;
}

For Arch::amdgpu the function falls through to return 0. The Python wrapper in stream.py then sees handle == 0 and treats the stream as a no-op: destroy() skips the C++ call (if self._handle != 0 and self._prog_ref is not None), synchronize() calls prog.stream_synchronize(0) which is also a no-op under the same gate, and the kernel-launch path in kernel.py skips set_current_cuda_stream because stream_handle is 0.

This is the same behavior as CPU/Metal/Vulkan — but only those three are documented in the table. The PR description even says "Non-CUDA backends return no-op handles (0)" confirming this is intended. AMDGPU just got left out of the table.

Step-by-step proof

  1. User runs on AMDGPU: qd.init(arch=qd.amdgpu).
  2. User reads streams.md, sees no AMDGPU row, and (reasonably) assumes either streams are supported (since AMDGPU is a GPU backend) or that it might error — neither is true.
  3. User writes s = qd.create_stream(); kernel(qd_stream=s) and expects either GPU concurrency or an error.
  4. create_stream()Program::stream_create()compile_config().arch == Arch::cuda is false → return 0.
  5. kernel(qd_stream=s) enters launch_kernel with stream_handle = s.handle = 0. The if stream_handle: guard at kernel.py:572 is false, so set_current_cuda_stream is never called.
  6. The kernel runs on the AMDGPU default queue, exactly as if no stream had been requested. Two kernels on different qd-streams are still serialized.
  7. The user has no way to discover this from the docs short of reading the C++ source.

Impact and fix

Documentation-only completeness gap with no runtime correctness impact. Severity is nit because the runtime behavior matches the other no-op backends — users just have no signpost telling them so for AMDGPU. The fix is one line in the table:

| AMDGPU  | No-op   | No-op  | qd_stream is silently ignored, kernels run serially |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unaddressed in the latest commit. The compatibility table at streams.md:7-12 lists only CUDA, CPU, Metal, and Vulkan; no AMDGPU row has been added. AMDGPU users still have no docs signpost telling them qd_stream is silently a no-op on their backend.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unaddressed in the latest commit (c549e07). The streams.md compatibility table (lines 7-12) still lists only CUDA, CPU, Metal, and Vulkan; no AMDGPU row has been added, so AMDGPU users still have no docs signpost telling them qd_stream is silently a no-op on their backend.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still unaddressed in the latest commit (ff8056d). The compatibility table at streams.md:7-12 lists only CUDA, CPU, Metal, and Vulkan; no AMDGPU row has been added, so AMDGPU users still have no docs signpost telling them qd_stream is silently a no-op on their backend.

On backends without native stream support, `create_stream()` and `create_event()` return objects with handle `0`. All stream/event operations become no-ops and kernels run serially. Code written with streams is portable across all backends in the sense that it will run without modifications, but serially.

## Creating and using streams

```python
import quadrants as qd

qd.init(arch=qd.cuda)

N = 1024
a = qd.field(qd.f32, shape=(N,))
b = qd.field(qd.f32, shape=(N,))

@qd.kernel
def fill_a():
for i in range(N):
a[i] = 1.0

@qd.kernel
def fill_b():
for i in range(N):
b[i] = 2.0

s1 = qd.create_stream()
s2 = qd.create_stream()

fill_a(qd_stream=s1)
fill_b(qd_stream=s2)

s1.synchronize()
s2.synchronize()

s1.destroy()
s2.destroy()
```

Pass `qd_stream=` to any kernel call to launch it on that stream. Kernels on different streams may execute concurrently. Call `synchronize()` to block until all work on a stream completes.

## Events

Events let you express dependencies between streams without full synchronization.

```python
s1 = qd.create_stream()
s2 = qd.create_stream()

@qd.kernel
def produce():
for i in range(N):
a[i] = 10.0

@qd.kernel
def consume():
for i in range(N):
b[i] = a[i]

produce(qd_stream=s1)

e = qd.create_event()
e.record(s1) # record when s1 finishes produce()
e.wait(qd_stream=s2) # s2 waits for that event before proceeding

consume(qd_stream=s2) # safe to read a[] — produce() is guaranteed complete
s2.synchronize()

e.destroy()
s1.destroy()
s2.destroy()
```

`e.record(stream)` captures the point in `stream`'s execution. `e.wait(qd_stream=stream)` makes `stream` wait until the recorded point is reached. If `qd_stream` is omitted, the default stream waits.

## Context managers

Streams and events support `with` blocks for automatic cleanup:

```python
with qd.create_stream() as s:
fill_a(qd_stream=s)
s.synchronize()
# s.destroy() called automatically
```

## PyTorch interop (CUDA)

When mixing Quadrants kernels with PyTorch operations on CUDA, both frameworks must use the same stream to avoid race conditions. Without explicit stream management, Quadrants and PyTorch may launch work on different streams with no ordering guarantees, leading to intermittent data corruption.

### Running Quadrants kernels on PyTorch's stream

```python
import torch
from quadrants.lang.stream import Stream

torch_stream_ptr = torch.cuda.current_stream().cuda_stream
stream = Stream(torch_stream_ptr)
Comment on lines +105 to +108
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PyTorch interop example in streams.md uses from quadrants.lang.stream import Stream (private implementation path) instead of the public API qd.Stream. Since Stream is exported via qd.Stream (per __all__ in stream.py and re-exported by quadrants.lang.__init__), the example should be stream = qd.Stream(torch_stream_ptr). Reaching into the impl module is brittle to future internal refactors that would silently break the docs without breaking the public API.

Extended reasoning...

What the bug is

In docs/source/user_guide/streams.md:103-113, the PyTorch interop section shows:

import torch
from quadrants.lang.stream import Stream

torch_stream_ptr = torch.cuda.current_stream().cuda_stream
stream = Stream(torch_stream_ptr)

This imports Stream directly from the implementation module quadrants.lang.stream rather than using the public qd.Stream namespace.

Why qd.Stream is the right reference

Stream is genuinely part of the public API:

  • python/quadrants/lang/stream.py:130 declares __all__ = ["Stream", "Event", "create_stream", "create_event"].
  • python/quadrants/lang/__init__.py:19 does from quadrants.lang.stream import *, picking up that __all__.
  • quadrants/__init__.py further re-exports via from quadrants.lang import *, so qd.Stream resolves.
  • This PR explicitly adds Stream to the asserted public-API surface in tests/python/test_api.py:65 (and Event, create_stream, create_event alongside).

The rest of the same docs file consistently uses the qd.-prefixed public API (qd.create_stream(), qd.create_event(), qd.field, qd.kernel, qd.sync(), etc.). This single PyTorch-interop example is the only place that reaches into the impl module path.

Why this matters

It is functional today, so this is a documentation nit rather than a runtime bug. The cost is forward-compatibility: any future refactor of python/quadrants/lang/stream.py (e.g. moving Stream to a different submodule, splitting the file, or renaming it) would silently break this docs example without breaking the public API contract that the PR establishes via test_api.py. Consumers who copy-paste the docs example would get an ImportError at a path the project never promised to keep stable, and the test suite would not catch it.

How to fix

Drop the import line and use the qd-prefixed name in the assignment, matching every other example on this page:

import torch

torch_stream_ptr = torch.cuda.current_stream().cuda_stream
stream = qd.Stream(torch_stream_ptr)

physics_kernel(qd_stream=stream)
observations = compute_obs_tensor()
apply_actions_kernel(qd_stream=stream)

This is a one-line edit in docs/source/user_guide/streams.md.


physics_kernel(qd_stream=stream)
observations = compute_obs_tensor() # PyTorch op on the same stream
apply_actions_kernel(qd_stream=stream)
```

Wrap PyTorch's raw `CUstream` pointer in a Quadrants `Stream` object. Do **not** call `destroy()` on this wrapper — PyTorch owns the underlying stream.

### Running PyTorch operations on a Quadrants stream

```python
qd_stream = qd.create_stream()
torch_stream = torch.cuda.ExternalStream(qd_stream.handle)

with torch.cuda.stream(torch_stream):
physics_kernel(qd_stream=qd_stream)
observations = compute_obs_tensor()
apply_actions_kernel(qd_stream=qd_stream)

qd_stream.destroy()
```

`Stream.handle` is the raw `CUstream` pointer, which `torch.cuda.ExternalStream` accepts directly.

## Limitations

- **Not compatible with graphs.** Do not pass `qd_stream` to a kernel decorated with `graph=True`.
- **Not compatible with autodiff.** Do not pass `qd_stream` to a kernel that uses reverse-mode or forward-mode differentiation, or inside a `qd.ad.Tape` context.
- **`qd.sync()` only waits on the default stream.** It does not drain explicit streams. Call `stream.synchronize()` on each stream you need to wait for.
- **No automatic synchronization.** You are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input.
2 changes: 2 additions & 0 deletions python/quadrants/lang/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from quadrants.lang.runtime_ops import *
from quadrants.lang.snode import *
from quadrants.lang.source_builder import *
from quadrants.lang.stream import *
from quadrants.lang.struct import *
from quadrants.types.enums import DeviceCapability, Format, Layout # noqa: F401

Expand Down Expand Up @@ -47,6 +48,7 @@
"shell",
"snode",
"source_builder",
"stream",
"struct",
"util",
]
Expand Down
33 changes: 30 additions & 3 deletions python/quadrants/lang/kernel.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,9 @@ def materialize(self, key: "CompiledKernelKeyType | None", py_args: tuple[Any, .
]
runtime._current_global_context = None

def launch_kernel(self, key, t_kernel: KernelCxx, compiled_kernel_data: CompiledKernelData | None, *args) -> Any:
def launch_kernel(
self, key, t_kernel: KernelCxx, compiled_kernel_data: CompiledKernelData | None, *args, qd_stream=None
) -> Any:
assert len(args) == len(self.arg_metas), f"{len(self.arg_metas)} arguments needed but {len(args)} provided"

callbacks: list[Callable[[], None]] = []
Expand Down Expand Up @@ -567,9 +569,21 @@ def launch_kernel(self, key, t_kernel: KernelCxx, compiled_kernel_data: Compiled
self.src_ll_cache_observations.cache_stored = True
self._last_compiled_kernel_data = compiled_kernel_data
launch_ctx.use_graph = self.use_graph and _GRAPH_ENABLED
if self.use_graph and qd_stream is not None:
raise RuntimeError(
"qd_stream is not compatible with graph=True kernels. "
"See docs/source/user_guide/streams.md for details."
)
Comment thread
claude[bot] marked this conversation as resolved.
if self.graph_do_while_arg is not None and hasattr(self, "_graph_do_while_cpp_arg_id"):
launch_ctx.graph_do_while_arg_id = self._graph_do_while_cpp_arg_id
prog.launch_kernel(compiled_kernel_data, launch_ctx)
stream_handle = qd_stream.handle if qd_stream is not None else 0
if stream_handle:
prog.set_current_cuda_stream(stream_handle)
try:
prog.launch_kernel(compiled_kernel_data, launch_ctx)
finally:
if stream_handle:
prog.set_current_cuda_stream(0)
except Exception as e:
e = handle_exception_from_cpp(e)
if impl.get_runtime().print_full_traceback:
Expand All @@ -581,6 +595,8 @@ def launch_kernel(self, key, t_kernel: KernelCxx, compiled_kernel_data: Compiled

return_type = self.return_type
if return_type or self.has_print:
if qd_stream is not None and self.has_print and not return_type:
qd_stream.synchronize()
runtime_ops.sync()

if not return_type:
Expand Down Expand Up @@ -647,6 +663,17 @@ def ensure_compiled(self, *py_args: tuple[Any, ...]) -> tuple[Callable, int, Aut
# Thus this part needs to be fast. (i.e. < 3us on a 4 GHz x64 CPU)
@_shell_pop_print
def __call__(self, *py_args, **kwargs) -> Any:
qd_stream = kwargs.pop("qd_stream", None)
if qd_stream is not None and self.autodiff_mode != _NONE:
raise RuntimeError(
"qd_stream is not compatible with autodiff kernels. Streams cannot be used with "
"reverse-mode or forward-mode differentiation."
)
if qd_stream is not None and self.runtime.target_tape:
raise RuntimeError(
"qd_stream is not compatible with autograd Tape. Launch the kernel outside the Tape "
"context, or omit qd_stream."
)
Comment on lines +666 to +676
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 FwdMode + qd_stream is not rejected at kernel.py:658-663 (only target_tape is checked), but FwdMode.__enter__ writes the autodiff seed via param.dual[None] = ... on the NULL stream with no post-sync. The user's subsequent forward kernel queues on a CU_STREAM_NON_BLOCKING qd_stream with no implicit ordering against NULL — the forward reads param.dual before the seed write retires, producing silently-wrong dual outputs. Fix is the obvious symmetric extension of the Tape rejection: also check self.runtime.fwd_mode_manager at line 659, and update streams.md Limitations to mention FwdMode.

Extended reasoning...

What the bug is

Kernel.__call__ at python/quadrants/lang/kernel.py:658-663 rejects qd_stream when self.runtime.target_tape is set, but does not perform an analogous check for self.runtime.fwd_mode_manager. FwdMode exhibits the same kind of seed-write/user-kernel race that the Tape rejection was designed to prevent.

The exact code path that triggers it

  1. FwdMode.__enter__ (quadrants/ad/_ad.py:478-486) writes the autodiff seed via self.param.dual[None] = 1.0 * self.seed[0] (or self.param.dual[0] = ... / self.param.dual.from_numpy(...) for non-scalar seeds). The single-element write goes through ScalarField.__setitem__SNodeHostAccessor.settersnode.write_float, pybind11-bound to SNodeRwAccessorsBank::Accessors::write_float (quadrants/program/snode_rw_accessors_bank.cpp:31-39).

  2. write_float calls prog_->synchronize() BEFORE the writer kernel launch, then calls prog_->launch_kernel(...) which descends through KernelLauncher::launch_llvm_kernel (this PR's file). The writer kernel is a single-element field write: transfers.size() == 0, result_buffer_size == 0 (no insert_ret in Program::get_snode_writer), and has_print is false. Neither end-of-launcher branch at kernel_launcher.cpp:255-268 fires, so the writer kernel is left in flight on the NULL stream (active_stream is still nullptr at this point — no qd_stream is active during __enter__).

  3. After __enter__ returns, self.runtime.fwd_mode_manager is set (_ad.py:489) but target_tape is not.

  4. The user calls f(qd_stream=s). kernel.py:658-663 only checks target_tapefwd_mode_manager is missed. FwdModeManager.insert at line 709 transforms the kernel to autodiff_mode=FORWARD; the forward-transformed kernel is launched at line 726 with qd_stream=s.

  5. s is a CU_STREAM_NON_BLOCKING stream (program.cpp:500). Per CUDA semantics, non-blocking streams have no implicit ordering with the legacy NULL stream — that is the entire point of the non-blocking flag this PR introduces. The forward kernel can start reading param.dual before the writer kernel on NULL retires.

Step-by-step proof

import quadrants as qd
qd.init(arch=qd.cuda)

x = qd.field(qd.f32, shape=(), needs_grad=True)
loss = qd.field(qd.f32, shape=(), needs_grad=True)

@qd.kernel
def f():
    loss[None] = x[None] * x[None]

x[None] = 3.0
s = qd.create_stream()  # CU_STREAM_NON_BLOCKING
with qd.ad.FwdMode(loss=loss, param=x, seed=[1.0]):
    # __enter__ ran:
    #   - clear_all_gradients(DUAL) → x.dual[None] = 0.0 written on NULL stream
    #   - then x.dual[None] = 1.0 written on NULL stream via write_float
    #   - write_float pre-syncs but does NOT post-sync; writer is in flight
    f(qd_stream=s)
    # FORWARD-transformed kernel queues on s (non-blocking).
    # No ordering with NULL — forward reads x.dual before writer retires.
    # Reads stale 0.0 instead of 1.0 → JVP computed against zero seed.
s.synchronize()
print(loss.grad[None])  # Expected ~6.0 (= 2*x*1.0). Race → may be 0.0.

The trailing kernel_launcher sync at lines 266-268 is the user's forward kernel's own sync (it drains s, not the unrelated NULL-stream writer that ran during __enter__). The qd.sync() path (which lowers to cuStreamSynchronize(NULL)) is also not invoked between __enter__ and f(qd_stream=s).

Mirror race in __exit__'s clear_seed: writes dual=0 on NULL stream while the user's forward kernel may still be in flight on s; can overwrite the seed before the user-launched forward kernel has read it.

Why existing safeguards do not cover it

  • docs/source/user_guide/streams.md Limitations section lists Tape as incompatible (added by commit 6374cf3) but says nothing about FwdMode. A user reading the docs reasonably assumes FwdMode + qd_stream is supported, especially since FwdMode is exported in qd.ad and exercised by multiple tests (test_ad_basics_fwd, test_ad_atomic_fwd, test_ad_for_fwd, test_ad_if_fwd, test_adstack).
  • The "No automatic synchronization" doc note covers user-authored writes (field[i] = v from user code), but the seed write here is an INTERNAL operation done by FwdMode.__enter__ that the user did not author and cannot easily sync against.
  • No test exercises FwdMode + qd_stream.
  • The qd.sync() races bug is related but distinct: even after fixing qd.sync() to use cuCtxSynchronize, this race still occurs because there is no qd.sync() between __enter__ and the user's kernel call.

Why pre-PR was correct

Before this PR, all kernel launches went on the NULL stream, so the seed writer (on NULL) and the forward kernel (also on NULL) were sequentially ordered through the same stream. No race was possible. The PR introduces CU_STREAM_NON_BLOCKING user streams and the qd_stream API — for Tape it explicitly rejects the combination at kernel.py:659-663 (commit 6374cf3), but the analogous protection for FwdMode was missed.

How to fix

Mirror the Tape rejection at kernel.py:659:

if qd_stream is not None and (self.runtime.target_tape or self.runtime.fwd_mode_manager):
    raise RuntimeError(
        "qd_stream is not compatible with autograd Tape / FwdMode. ..."
    )

And add FwdMode to the streams.md Limitations section (which currently only mentions Tape).

Alternatively, synchronize the seed write inside FwdMode.__enter__ before returning, but that's more invasive and still leaves the user responsible for syncing dual fields after the forward kernel — option (1) is consistent with what was done for Tape and avoids any surprising behavior.

if impl.get_runtime()._arch == _ARCH_PYTHON:
return self.func(*py_args, **kwargs)
config = impl.current_cfg()
Comment thread
claude[bot] marked this conversation as resolved.
Expand Down Expand Up @@ -709,7 +736,7 @@ def __call__(self, *py_args, **kwargs) -> Any:
kernel_cpp = self.materialized_kernels[key]
compiled_kernel_data = self.compiled_kernel_data_by_key.get(key, None)
self.launch_observations.found_kernel_in_materialize_cache = compiled_kernel_data is not None
ret = self.launch_kernel(key, kernel_cpp, compiled_kernel_data, *py_args)
ret = self.launch_kernel(key, kernel_cpp, compiled_kernel_data, *py_args, qd_stream=qd_stream)
if compiled_kernel_data is None:
assert self._last_compiled_kernel_data is not None
self.compiled_kernel_data_by_key[key] = self._last_compiled_kernel_data
Expand Down
6 changes: 4 additions & 2 deletions python/quadrants/lang/runtime_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@


def sync():
"""Blocks the calling thread until all the previously
launched Quadrants kernels have completed.
"""Synchronizes the default stream.

Blocks the calling thread until all work on the default GPU stream has completed. Kernels launched on explicit
streams created via :func:`quadrants.create_stream` are **not** waited on — call ``stream.synchronize()`` for those.
"""
impl.get_runtime().sync()

Expand Down
Loading
Loading