[Perf] CUDA Graph 1: unconditional graphs#405
Conversation
When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor
Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor
Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor
On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor
Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.
|
Opus 4.6 review: Review:
|
The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.
Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.
Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.
Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.
|
For opus review:
|
|
Missing some doc. |
Made-with: Cursor
| @@ -0,0 +1,109 @@ | |||
| # CUDA Graph | |||
|
|
|||
| CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. This is most beneficial for kernels that compile into multiple GPU tasks (e.g. kernels with multiple top-level `for` loops), where the per-task launch overhead would otherwise dominate. | |||
There was a problem hiding this comment.
This is confusing the implementation with the benefit. Let's reword.
|
|
||
| Use `cuda_graph=True` on kernels that: | ||
|
|
||
| - Run on CUDA (`arch=qd.cuda`) |
There was a problem hiding this comment.
this makes is appear that you can only run these kernels on cuda. You can run them elsewhre, but it falls back to non-graph on other platforms.
There was a problem hiding this comment.
I cut this entire section
| Use `cuda_graph=True` on kernels that: | ||
|
|
||
| - Run on CUDA (`arch=qd.cuda`) | ||
| - Contain **two or more** top-level `for` loops (i.e. compile into multiple offloaded tasks) |
There was a problem hiding this comment.
again, mixing the implemtnation with the use case
|
|
||
| - **No struct return values.** Kernels that return values (e.g. `-> qd.i32`) cannot use CUDA graphs. An error is raised if `cuda_graph=True` is set on such a kernel. | ||
| - **Primal kernels only.** The `cuda_graph=True` flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path. | ||
| - **Non-CUDA backends.** On non-CUDA backends (CPU, Vulkan, Metal), `cuda_graph=True` is silently ignored. This means you can annotate a kernel unconditionally and it will work on all platforms. |
There was a problem hiding this comment.
this isnt really a 'restriction' I feel
|
|
||
| ### Fields as arguments | ||
|
|
||
| Fields (SNode-backed data created with `qd.field`) are accessed through the global runtime pointer, not through the kernel argument buffer. The graph captures this pointer at build time, so fields work transparently with CUDA graphs. |
There was a problem hiding this comment.
remove this paragraph. too technical
|
|
||
| --- | ||
|
|
||
| ## Advanced: Implementation Details |
There was a problem hiding this comment.
Lets remove advanced implementation details
| }; | ||
|
|
||
| struct CachedCudaGraph { | ||
| void *graph_exec{nullptr}; |
There was a problem hiding this comment.
lets add a comment saying what is graph_exec
| const std::vector<std::pair<int, Callable::Parameter>> ¶meters); | ||
| bool launch_llvm_kernel_graph(Handle handle, LaunchContextBuilder &ctx); | ||
| std::vector<Context> contexts_; | ||
| std::unordered_map<int, CachedCudaGraph> cuda_graph_cache_; |
There was a problem hiding this comment.
Lets add a comment saying what is the index to this map
The < 2 tasks guard was rejecting single-loop kernels from the graph path even when the user explicitly requested cuda_graph=True. Relax the check to only skip empty task lists, respecting the user's intent.
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…on unwrapping Made-with: Cursor
…Allocation unwrapping" This reverts commit c94b41c.
…olving Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
|
refactorized cuda graph things into separate class, to keep things cleaner. |
Made-with: Cursor
cuda_graph requires all ndarrays to be device-resident. Previously this silently fell back to the non-graph path; now it throws a clear error message. Made-with: Cursor
Made-with: Cursor
| return false; | ||
| } | ||
|
|
||
| QD_ERROR_IF(ctx.result_buffer_size > 0, |
There was a problem hiding this comment.
Just curious why this is not supported?
There was a problem hiding this comment.
What opus says:
" CUDA graphs capture a fixed sequence of GPU operations and replay them. When a kernel returns a struct value, the result needs to be copied from a device result buffer back to
the host after execution. The problem is that during graph replay:
- The graph writes results to the persistent result buffer (baked into the graph at capture time)
- But the host-side code that reads the return value expects it at the original buffer location from the launch context
"The result buffer address mismatch means the host would read stale/wrong data after replay. Fixing this would require updating the result buffer pointer on the host side after
each replay, or adding a D2H copy node into the graph — neither of which is implemented.
That said, kernels with struct return values are uncommon in practice (most kernels write to ndarrays instead), so it hasn't been a priority."
Note that adding a D2H copy would incur latency I think?
There was a problem hiding this comment.
Actually, why would we want to return a struct from a gpu kernel? Wouldnt this always incur latency, and cause a gpu pipeline stall?
| @@ -0,0 +1,304 @@ | |||
| import numpy as np | |||
There was a problem hiding this comment.
It seems that we only have test cases for one kernel (though two loops are decomposed into two offloaded tasks). Shall we add test cases for multiple kernels?
There was a problem hiding this comment.
what do you mean by 'multiple kernels'? I'm only intending to support cuda graph for a single Quadrants kernel, containing multiple top level loops. This is an intentional design decision.
There was a problem hiding this comment.
i see. I was thinking a large graph which can record all kenrel launches together. But if it is designed for single kernel, then the tests look good to me.
There was a problem hiding this comment.
Yeah, so what I'm imagining in my head is that instead of eg
@qd.kernel
def linesearch_kernel(...):
for i_b in range(B):
...
@qd.kernel
def hessian_kernel(...):
for i_b in range(B):
...
@qd.kernel
def check_not_done_kernel(...):
for i_b in range(B):
...
def python_solver_loop(...):
while not_done:
linesearch_kernel(...)
hessian_kernel(...)
not_done = check_not_done_kernel(...)
We'd have:
@qd.func
def linesearch_func(...):
for i_b in range(B):
...
@qd.func
def hessian_func(...):
for i_b in range(B):
...
@qd.func
def check_not_done_func(...):
for i_b in range(B):
...
@qd.kernel(graph_while=not_done)
def graph_solver_loop(not_done, ...):
linesearch_func(...)
hessian_func(...)
check_not_done_func(...)
Thoughts? (Note that changing how this looks might not be tons of work, depending on the desired approach, so feel free to propose how you are imagining this might look)
There was a problem hiding this comment.
What I can do if you want is to have some tests that call @qd.func's, rather than having 'naked' for loops inside the cuda graph function?
There was a problem hiding this comment.
see test_cuda_graph_multi_func
Expose the number of offloaded tasks (parallel-for loops) as a compile-time property on CompiledKernelData, captured per launch in Program and accessible via pybind11. Assert expected task counts in cuda graph tests. Made-with: Cursor
Track the number of kernel nodes in each cached CUDA graph and expose it via get_cuda_graph_num_nodes_on_last_call() through the full CudaGraphManager → KernelLauncher → Program → pybind11 chain. Assert node counts match offloaded task counts in all cuda graph tests. Made-with: Cursor
Test a kernel that calls three @qd.func with 2, 4, and 3 top-level for loops respectively, asserting 9 offloaded tasks and 9 graph nodes. Made-with: Cursor
32b8d65 to
470b073
Compare
The LLVM x64 backend generates extra tasks per ndarray argument for serialization/setup, so exact equality checks fail. Use >= instead. Made-with: Cursor
Ndarray kernels can produce additional serial tasks beyond the user-visible loops, so hardcoding expected node counts breaks. Use the actual num_offloaded_tasks instead.
Resolve conflicts from squash-merged MVP-1 PR (#405) vs branch's pre-existing MVP-1 merge commits. Keep all graph_do_while (MVP-2) additions. Incorporate grad_ptr local variable cleanup from main.
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough