[Perf] CUDA Graph 1: unconditional graphs by hughperkins · Pull Request #405 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-11T19:04:25Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor

Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor

Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.

hughperkins · 2026-03-11T19:21:34Z

Opus 4.6 review:

Review: `hp/cuda-graph-mvp-1-graph-build`

Overview

The branch adds CUDA graph capture/replay for multi-task kernels. Instead of launching each offloaded task (top-level for loop) as a separate kernel, it builds a CUDA graph using the explicit node API (cuGraphAddKernelNode) on first invocation and replays it on subsequent calls. This eliminates per-kernel launch overhead for multi-task kernels.

Commit progression

The commit history tells a clean story:

Initial MVP gated behind QD_CUDA_GRAPH=1 env var
Bug fixes
Replaced env var with per-kernel @qd.kernel(cuda_graph=True) API — much better design
Cross-platform test (no-op on non-CUDA)
Handle argument changes on replay by re-uploading the arg buffer
Formatting + disable on adjoint kernels (sensible until AD interaction is validated)

What looks good

Clean RAII: CachedCudaGraph has proper move semantics, deleted copy, and a destructor that frees the graph exec and device buffers.
Graceful fallback: If < 2 tasks, launch_llvm_kernel_graph returns false and the normal path runs. Same if ndarray pointers aren't on-device.
Per-kernel opt-in: The cuda_graph=True flag flows cleanly from the Python decorator through kernel_impl.py → LaunchContextBuilder → C++ launcher. No global state.
Good test coverage: 6 test cases covering 2-loop, 3-loop, single-loop (no graph), no annotation, changed args, and cross-platform no-op.
Adjoint excluded: Only the primal kernel gets cuda_graph; the adjoint doesn't, which avoids surprises with reverse-mode AD.

Concerns

Result buffer never copied back on replay. The non-graph path allocates a fresh device_result_buffer, copies it back to the host after execution, and frees it. The graph path has a persistent_device_result_buffer but never copies results back — not on capture, not on replay. If any kernel writes to result_buffer and the host reads struct return values, those reads will silently get stale/zero data. This works today only because the test kernels write to ndarrays, not return values.
No graph invalidation. The graph is cached by launch_id forever. If grid/block dims could ever change between invocations (e.g., different-sized ndarrays triggering different grid dims), the cached graph would use stale launch configs. This assumption should at least be documented, or ideally validated at replay time.
Linear dependency chain limits parallelism. Each graph node depends on the previous one (prev_node). If two tasks are truly independent (e.g., x[i] += 1 and y[i] += 2 on disjoint arrays), CUDA could run them in parallel within the graph, but the linear chain prevents that. Not a correctness issue, but it leaves performance on the table.
External array host→device transfers not supported in graph path. resolve_ctx_ndarray_ptrs returns false if an external array isn't already on device, which falls back to the non-graph path. That's fine as a fallback, but worth noting as a scope limitation.
Comment removal. Several useful comments were stripped from the non-graph launch_llvm_kernel path (explaining transfers, device_ptrs, the zero-sized array edge case, etc.). These seem like collateral damage from the diff rather than intentional, and they lose context for future readers.
persistent_ctx.runtime could go stale. The LLVM runtime pointer is captured at graph creation time. If the runtime were ever reallocated, this would dangle. Probably fine in practice since the runtime is long-lived, but it's a hidden assumption.

Summary

Solid MVP. The architecture is clean, the API is well-designed, and the test coverage is good. The main thing to address before merging is concern #1 (result buffer not copied back) — it's a silent-corruption bug waiting to happen when someone uses cuda_graph=True on a kernel with struct return values. The rest are either acceptable scope limitations or minor polish.

The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.

Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.

Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.

Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.

hughperkins · 2026-03-11T19:39:29Z

For opus review:

added exception for this case
added test for this case
out of scope
doc seems to be in the next pr, so ignoring this for now
comments put back
test added

…raph-build

hughperkins · 2026-03-13T21:18:42Z

Missing some doc.

Made-with: Cursor

hughperkins · 2026-03-13T21:21:53Z

@@ -0,0 +1,109 @@
+# CUDA Graph
+
+CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. This is most beneficial for kernels that compile into multiple GPU tasks (e.g. kernels with multiple top-level `for` loops), where the per-task launch overhead would otherwise dominate.


This is confusing the implementation with the benefit. Let's reword.

hughperkins · 2026-03-13T21:22:47Z

+
+Use `cuda_graph=True` on kernels that:
+
+- Run on CUDA (`arch=qd.cuda`)


this makes is appear that you can only run these kernels on cuda. You can run them elsewhre, but it falls back to non-graph on other platforms.

I cut this entire section

hughperkins · 2026-03-13T21:23:04Z

+Use `cuda_graph=True` on kernels that:
+
+- Run on CUDA (`arch=qd.cuda`)
+- Contain **two or more** top-level `for` loops (i.e. compile into multiple offloaded tasks)


again, mixing the implemtnation with the use case

hughperkins · 2026-03-13T21:23:49Z

+
+- **No struct return values.** Kernels that return values (e.g. `-> qd.i32`) cannot use CUDA graphs. An error is raised if `cuda_graph=True` is set on such a kernel.
+- **Primal kernels only.** The `cuda_graph=True` flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.
+- **Non-CUDA backends.** On non-CUDA backends (CPU, Vulkan, Metal), `cuda_graph=True` is silently ignored. This means you can annotate a kernel unconditionally and it will work on all platforms.


this isnt really a 'restriction' I feel

hughperkins · 2026-03-13T21:24:17Z

+
+### Fields as arguments
+
+Fields (SNode-backed data created with `qd.field`) are accessed through the global runtime pointer, not through the kernel argument buffer. The graph captures this pointer at build time, so fields work transparently with CUDA graphs.


remove this paragraph. too technical

hughperkins · 2026-03-13T21:25:29Z

+
+---
+
+## Advanced: Implementation Details


Lets remove advanced implementation details

hughperkins · 2026-03-13T21:28:54Z

+};
+
+struct CachedCudaGraph {
+  void *graph_exec{nullptr};


lets add a comment saying what is graph_exec

hughperkins · 2026-03-13T21:29:43Z

+      const std::vector<std::pair<int, Callable::Parameter>> &parameters);
+  bool launch_llvm_kernel_graph(Handle handle, LaunchContextBuilder &ctx);
  std::vector<Context> contexts_;
+  std::unordered_map<int, CachedCudaGraph> cuda_graph_cache_;


Lets add a comment saying what is the index to this map

The < 2 tasks guard was rejecting single-loop kernels from the graph path even when the user explicitly requested cuda_graph=True. Relax the check to only skip empty task lists, respecting the user's intent.

Made-with: Cursor

…on unwrapping Made-with: Cursor

…Allocation unwrapping" This reverts commit c94b41c.

…olving Made-with: Cursor

Made-with: Cursor

hughperkins · 2026-03-14T13:44:00Z

refactorized cuda graph things into separate class, to keep things cleaner.

Made-with: Cursor

cuda_graph requires all ndarrays to be device-resident. Previously this silently fell back to the non-graph path; now it throws a clear error message. Made-with: Cursor

Made-with: Cursor

erizmr · 2026-03-14T17:43:11Z

+    return false;
+  }
+
+  QD_ERROR_IF(ctx.result_buffer_size > 0,


Just curious why this is not supported?

What opus says:

" CUDA graphs capture a fixed sequence of GPU operations and replay them. When a kernel returns a struct value, the result needs to be copied from a device result buffer back to
the host after execution. The problem is that during graph replay:

The graph writes results to the persistent result buffer (baked into the graph at capture time)

But the host-side code that reads the return value expects it at the original buffer location from the launch context

"The result buffer address mismatch means the host would read stale/wrong data after replay. Fixing this would require updating the result buffer pointer on the host side after
each replay, or adding a D2H copy node into the graph — neither of which is implemented.
That said, kernels with struct return values are uncommon in practice (most kernels write to ndarrays instead), so it hasn't been a priority."

Note that adding a D2H copy would incur latency I think?

Actually, why would we want to return a struct from a gpu kernel? Wouldnt this always incur latency, and cause a gpu pipeline stall?

erizmr · 2026-03-14T17:47:02Z

@@ -0,0 +1,304 @@
+import numpy as np


It seems that we only have test cases for one kernel (though two loops are decomposed into two offloaded tasks). Shall we add test cases for multiple kernels?

what do you mean by 'multiple kernels'? I'm only intending to support cuda graph for a single Quadrants kernel, containing multiple top level loops. This is an intentional design decision.

i see. I was thinking a large graph which can record all kenrel launches together. But if it is designed for single kernel, then the tests look good to me.

Yeah, so what I'm imagining in my head is that instead of eg

@qd.kernel def linesearch_kernel(...): for i_b in range(B): ... @qd.kernel def hessian_kernel(...): for i_b in range(B): ... @qd.kernel def check_not_done_kernel(...): for i_b in range(B): ... def python_solver_loop(...): while not_done: linesearch_kernel(...) hessian_kernel(...) not_done = check_not_done_kernel(...)

We'd have:

@qd.func def linesearch_func(...): for i_b in range(B): ... @qd.func def hessian_func(...): for i_b in range(B): ... @qd.func def check_not_done_func(...): for i_b in range(B): ... @qd.kernel(graph_while=not_done) def graph_solver_loop(not_done, ...): linesearch_func(...) hessian_func(...) check_not_done_func(...)

Thoughts? (Note that changing how this looks might not be tons of work, depending on the desired approach, so feel free to propose how you are imagining this might look)

What I can do if you want is to have some tests that call @qd.func's, rather than having 'naked' for loops inside the cuda graph function?

see test_cuda_graph_multi_func

Expose the number of offloaded tasks (parallel-for loops) as a compile-time property on CompiledKernelData, captured per launch in Program and accessible via pybind11. Assert expected task counts in cuda graph tests. Made-with: Cursor

Track the number of kernel nodes in each cached CUDA graph and expose it via get_cuda_graph_num_nodes_on_last_call() through the full CudaGraphManager → KernelLauncher → Program → pybind11 chain. Assert node counts match offloaded task counts in all cuda graph tests. Made-with: Cursor

Test a kernel that calls three @qd.func with 2, 4, and 3 top-level for loops respectively, asserting 9 offloaded tasks and 9 graph nodes. Made-with: Cursor

The LLVM x64 backend generates extra tasks per ndarray argument for serialization/setup, so exact equality checks fail. Use >= instead. Made-with: Cursor

Ndarray kernels can produce additional serial tasks beyond the user-visible loops, so hardcoding expected node counts breaks. Use the actual num_offloaded_tasks instead.

Resolve conflicts from squash-merged MVP-1 PR (#405) vs branch's pre-existing MVP-1 merge commits. Keep all graph_do_while (MVP-2) additions. Incorporate grad_ptr local variable cleanup from main.

hughperkins added 6 commits March 11, 2026 10:25

bug fixes for cuda graph

49ce3c1

Add cross-platform test for cuda_graph=True annotation

cffb9ae

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

Fix formatting and disable cuda_graph on adjoint kernels

85dc8db

Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.

hughperkins assigned duburcqa and erizmr Mar 11, 2026

Fix clang-format whitespace in kernel_launcher.cpp

e00fc15

hughperkins added 4 commits March 11, 2026 12:29

Reject cuda_graph=True on kernels with struct return values

0031619

The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.

Add test for cuda_graph with different-sized arrays

792ff34

Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.

Restore comments removed during cuda graph refactor

334c2e8

Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.

Add test for cuda_graph after qd.reset()

8f56ffd

Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.

hughperkins changed the title ~~[Perf] Add CUDA Graph, part 1~~ [Perf] CUDA Graph, part 1 Mar 12, 2026

hughperkins changed the title ~~[Perf] CUDA Graph, part 1~~ [Perf] CUDA Graph part 1: unconditional graphs Mar 12, 2026

hughperkins changed the title ~~[Perf] CUDA Graph part 1: unconditional graphs~~ [Perf] CUDA Graph 1: unconditional graphs Mar 12, 2026

hughperkins marked this pull request as draft March 12, 2026 04:59

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp-1-g…

8caa42c

…raph-build

Add CUDA graph documentation page

501362f

Made-with: Cursor

hughperkins commented Mar 13, 2026

View reviewed changes

hughperkins added 13 commits March 13, 2026 20:21

Allow cuda_graph=True for single-task kernels

9fbb433

The < 2 tasks guard was rejecting single-loop kernels from the graph path even when the user explicitly requested cuda_graph=True. Relax the check to only skip empty task lists, respecting the user's intent.

Remove test_cuda_graph_single_loop test

ae8d5a9

Extract CudaGraphManager from KernelLauncher into separate class

ddb552e

Made-with: Cursor

Make on_cuda_device a free function shared by both launch paths

76181bf

Made-with: Cursor

Move on_cuda_device to cuda_context where it belongs

683194d

Made-with: Cursor

Move on_cuda_device to runtime/cuda/cuda_utils

c17e59c

Made-with: Cursor

Extract resolve_device_alloc_ptr helper to deduplicate DeviceAllocati…

c94b41c

…on unwrapping Made-with: Cursor

Revert "Extract resolve_device_alloc_ptr helper to deduplicate Device…

55f03c1

…Allocation unwrapping" This reverts commit c94b41c.

Error on gradient pointers in cuda_graph path instead of silently res…

4f78c21

…olving Made-with: Cursor

Add comment explaining scalar parameter skip in resolve_ctx_ndarray_ptrs

e0200e5

Made-with: Cursor

Clarify that fields are template parameters and not handled here

05a7e4f

Made-with: Cursor

Add comment explaining resolved_data variable

b73dfb8

Made-with: Cursor

Add comment noting cache_size and used_on_last_call are for tests

34f685c

Made-with: Cursor

hughperkins added 3 commits March 14, 2026 09:45

Apply clang-format

e88fad4

Made-with: Cursor

Error instead of fallback when cuda_graph gets host-resident arrays

f175378

cuda_graph requires all ndarrays to be device-resident. Previously this silently fell back to the non-graph path; now it throws a clear error message. Made-with: Cursor

Fix clang-format indentation in QD_ERROR_IF

dd480d9

Made-with: Cursor

erizmr reviewed Mar 14, 2026

View reviewed changes

hughperkins added 3 commits March 14, 2026 16:01

Add num_offloaded_tasks query for compiled kernels

32b1341

Expose the number of offloaded tasks (parallel-for loops) as a compile-time property on CompiledKernelData, captured per launch in Program and accessible via pybind11. Assert expected task counts in cuda graph tests. Made-with: Cursor

Add multi-func cuda graph test with 9 offloaded tasks

470b073

Test a kernel that calls three @qd.func with 2, 4, and 3 top-level for loops respectively, asserting 9 offloaded tasks and 9 graph nodes. Made-with: Cursor

hughperkins force-pushed the hp/cuda-graph-mvp-1-graph-build branch from 32b8d65 to 470b073 Compare March 14, 2026 20:28

erizmr approved these changes Mar 14, 2026

View reviewed changes

hughperkins added 3 commits March 14, 2026 18:14

Fix offloaded tasks assertions to use >= for x64 ndarray compatibility

2996cb9

The LLVM x64 backend generates extra tasks per ndarray argument for serialization/setup, so exact equality checks fail. Use >= instead. Made-with: Cursor

Fix cuda graph tests: derive expected node count from offloaded tasks

b39c3a9

Ndarray kernels can produce additional serial tasks beyond the user-visible loops, so hardcoding expected node counts breaks. Use the actual num_offloaded_tasks instead.

Merge branch 'main' into hp/cuda-graph-mvp-1-graph-build

31b73a6

hughperkins merged commit aada310 into main Mar 16, 2026
47 checks passed

hughperkins deleted the hp/cuda-graph-mvp-1-graph-build branch March 16, 2026 16:14

		@@ -0,0 +1,109 @@
		# CUDA Graph

		CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. This is most beneficial for kernels that compile into multiple GPU tasks (e.g. kernels with multiple top-level `for` loops), where the per-task launch overhead would otherwise dominate.


		Use `cuda_graph=True` on kernels that:

		- Run on CUDA (`arch=qd.cuda`)


		### Fields as arguments

		Fields (SNode-backed data created with `qd.field`) are accessed through the global runtime pointer, not through the kernel argument buffer. The graph captures this pointer at build time, so fields work transparently with CUDA graphs.

Conversation

hughperkins commented Mar 11, 2026

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Mar 11, 2026

Review: hp/cuda-graph-mvp-1-graph-build

Overview

Commit progression

What looks good

Concerns

Summary

Uh oh!

hughperkins commented Mar 11, 2026

Uh oh!

hughperkins commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins commented Mar 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Review: `hp/cuda-graph-mvp-1-graph-build`