Skip to content

[Perf] CUDA Graph 1: unconditional graphs#405

Merged
hughperkins merged 52 commits intomainfrom
hp/cuda-graph-mvp-1-graph-build
Mar 16, 2026
Merged

[Perf] CUDA Graph 1: unconditional graphs#405
hughperkins merged 52 commits intomainfrom
hp/cuda-graph-mvp-1-graph-build

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded
tasks) are captured into a CUDA graph on first launch and replayed on
subsequent launches, eliminating per-kernel launch overhead.

Uses the explicit graph node API (cuGraphAddKernelNode) with persistent
device arg/result buffers. Assumes stable ndarray device pointers.

Made-with: Cursor
Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in.
The flag flows from the Python decorator through LaunchContextBuilder
to the CUDA kernel launcher, avoiding interference with internal
kernels like ndarray_to_ext_arr.

Made-with: Cursor
Verify that cuda_graph=True is a harmless no-op on non-CUDA backends
(tested on x64/CPU). Passes on both x64 and CUDA.

Made-with: Cursor
On each graph replay, re-resolve ndarray device pointers and re-upload
the arg buffer to the persistent device buffer. This ensures correct
results when the kernel is called with different ndarrays after the
graph was first captured.

Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs().

Made-with: Cursor
Apply lint formatting fixes (clang-format, ruff) and remove
cuda_graph flag from autodiff adjoint kernel until the interaction
with reverse-mode AD is validated.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Opus 4.6 review:

Review: hp/cuda-graph-mvp-1-graph-build

Overview

The branch adds CUDA graph capture/replay for multi-task kernels. Instead of launching each offloaded task (top-level for loop) as a separate kernel, it builds a CUDA graph using the explicit node API (cuGraphAddKernelNode) on first invocation and replays it on subsequent calls. This eliminates per-kernel launch overhead for multi-task kernels.

Commit progression

The commit history tells a clean story:

  1. Initial MVP gated behind QD_CUDA_GRAPH=1 env var
  2. Bug fixes
  3. Replaced env var with per-kernel @qd.kernel(cuda_graph=True) API — much better design
  4. Cross-platform test (no-op on non-CUDA)
  5. Handle argument changes on replay by re-uploading the arg buffer
  6. Formatting + disable on adjoint kernels (sensible until AD interaction is validated)

What looks good

  • Clean RAII: CachedCudaGraph has proper move semantics, deleted copy, and a destructor that frees the graph exec and device buffers.
  • Graceful fallback: If < 2 tasks, launch_llvm_kernel_graph returns false and the normal path runs. Same if ndarray pointers aren't on-device.
  • Per-kernel opt-in: The cuda_graph=True flag flows cleanly from the Python decorator through kernel_impl.pyLaunchContextBuilder → C++ launcher. No global state.
  • Good test coverage: 6 test cases covering 2-loop, 3-loop, single-loop (no graph), no annotation, changed args, and cross-platform no-op.
  • Adjoint excluded: Only the primal kernel gets cuda_graph; the adjoint doesn't, which avoids surprises with reverse-mode AD.

Concerns

  1. Result buffer never copied back on replay. The non-graph path allocates a fresh device_result_buffer, copies it back to the host after execution, and frees it. The graph path has a persistent_device_result_buffer but never copies results back — not on capture, not on replay. If any kernel writes to result_buffer and the host reads struct return values, those reads will silently get stale/zero data. This works today only because the test kernels write to ndarrays, not return values.

  2. No graph invalidation. The graph is cached by launch_id forever. If grid/block dims could ever change between invocations (e.g., different-sized ndarrays triggering different grid dims), the cached graph would use stale launch configs. This assumption should at least be documented, or ideally validated at replay time.

  3. Linear dependency chain limits parallelism. Each graph node depends on the previous one (prev_node). If two tasks are truly independent (e.g., x[i] += 1 and y[i] += 2 on disjoint arrays), CUDA could run them in parallel within the graph, but the linear chain prevents that. Not a correctness issue, but it leaves performance on the table.

  4. External array host→device transfers not supported in graph path. resolve_ctx_ndarray_ptrs returns false if an external array isn't already on device, which falls back to the non-graph path. That's fine as a fallback, but worth noting as a scope limitation.

  5. Comment removal. Several useful comments were stripped from the non-graph launch_llvm_kernel path (explaining transfers, device_ptrs, the zero-sized array edge case, etc.). These seem like collateral damage from the diff rather than intentional, and they lose context for future readers.

  6. persistent_ctx.runtime could go stale. The LLVM runtime pointer is captured at graph creation time. If the runtime were ever reallocated, this would dangle. Probably fine in practice since the runtime is long-lived, but it's a hidden assumption.

Summary

Solid MVP. The architecture is clean, the API is well-designed, and the test coverage is good. The main thing to address before merging is concern #1 (result buffer not copied back) — it's a silent-corruption bug waiting to happen when someone uses cuda_graph=True on a kernel with struct return values. The rest are either acceptable scope limitations or minor polish.

The graph path doesn't copy the result buffer back to the host,
so struct returns would silently return stale data. Error early
instead of producing wrong results.
Verifies that calling a cuda_graph=True kernel first with small
arrays then with larger ones produces correct results for all
elements — catches stale grid dims if the graph were incorrectly
replayed from the first capture.
Re-add documentation comments for |transfers|, |device_ptrs|,
zero-sized array handling, external array logic, and the
host copy-back section in the non-graph launch path.
Verify that a cuda_graph=True kernel works correctly after a
reset/reinit cycle — exercises the full teardown and rebuild
of the KernelLauncher and its graph cache.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

For opus review:

  1. added exception for this case
  2. added test for this case
  3. out of scope
  4. doc seems to be in the next pr, so ignoring this for now
  5. comments put back
  6. test added

@hughperkins hughperkins changed the title [Perf] Add CUDA Graph, part 1 [Perf] CUDA Graph, part 1 Mar 12, 2026
@hughperkins hughperkins changed the title [Perf] CUDA Graph, part 1 [Perf] CUDA Graph part 1: unconditional graphs Mar 12, 2026
@hughperkins hughperkins changed the title [Perf] CUDA Graph part 1: unconditional graphs [Perf] CUDA Graph 1: unconditional graphs Mar 12, 2026
@hughperkins hughperkins marked this pull request as draft March 12, 2026 04:59
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Missing some doc.

Comment thread docs/source/user_guide/cuda_graph.md Outdated
@@ -0,0 +1,109 @@
# CUDA Graph

CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. This is most beneficial for kernels that compile into multiple GPU tasks (e.g. kernels with multiple top-level `for` loops), where the per-task launch overhead would otherwise dominate.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing the implementation with the benefit. Let's reword.

Comment thread docs/source/user_guide/cuda_graph.md Outdated

Use `cuda_graph=True` on kernels that:

- Run on CUDA (`arch=qd.cuda`)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes is appear that you can only run these kernels on cuda. You can run them elsewhre, but it falls back to non-graph on other platforms.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cut this entire section

Comment thread docs/source/user_guide/cuda_graph.md Outdated
Use `cuda_graph=True` on kernels that:

- Run on CUDA (`arch=qd.cuda`)
- Contain **two or more** top-level `for` loops (i.e. compile into multiple offloaded tasks)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, mixing the implemtnation with the use case

Comment thread docs/source/user_guide/cuda_graph.md Outdated

- **No struct return values.** Kernels that return values (e.g. `-> qd.i32`) cannot use CUDA graphs. An error is raised if `cuda_graph=True` is set on such a kernel.
- **Primal kernels only.** The `cuda_graph=True` flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.
- **Non-CUDA backends.** On non-CUDA backends (CPU, Vulkan, Metal), `cuda_graph=True` is silently ignored. This means you can annotate a kernel unconditionally and it will work on all platforms.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isnt really a 'restriction' I feel

Comment thread docs/source/user_guide/cuda_graph.md Outdated

### Fields as arguments

Fields (SNode-backed data created with `qd.field`) are accessed through the global runtime pointer, not through the kernel argument buffer. The graph captures this pointer at build time, so fields work transparently with CUDA graphs.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this paragraph. too technical

Comment thread docs/source/user_guide/cuda_graph.md Outdated

---

## Advanced: Implementation Details
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove advanced implementation details

};

struct CachedCudaGraph {
void *graph_exec{nullptr};
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a comment saying what is graph_exec

const std::vector<std::pair<int, Callable::Parameter>> &parameters);
bool launch_llvm_kernel_graph(Handle handle, LaunchContextBuilder &ctx);
std::vector<Context> contexts_;
std::unordered_map<int, CachedCudaGraph> cuda_graph_cache_;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a comment saying what is the index to this map

@hughperkins
Copy link
Copy Markdown
Collaborator Author

refactorized cuda graph things into separate class, to keep things cleaner.

Made-with: Cursor
cuda_graph requires all ndarrays to be device-resident. Previously
this silently fell back to the non-graph path; now it throws a clear
error message.

Made-with: Cursor
return false;
}

QD_ERROR_IF(ctx.result_buffer_size > 0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why this is not supported?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What opus says:

" CUDA graphs capture a fixed sequence of GPU operations and replay them. When a kernel returns a struct value, the result needs to be copied from a device result buffer back to
the host after execution. The problem is that during graph replay:

  1. The graph writes results to the persistent result buffer (baked into the graph at capture time)
  2. But the host-side code that reads the return value expects it at the original buffer location from the launch context

"The result buffer address mismatch means the host would read stale/wrong data after replay. Fixing this would require updating the result buffer pointer on the host side after
each replay, or adding a D2H copy node into the graph — neither of which is implemented.
That said, kernels with struct return values are uncommon in practice (most kernels write to ndarrays instead), so it hasn't been a priority."

Note that adding a D2H copy would incur latency I think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why would we want to return a struct from a gpu kernel? Wouldnt this always incur latency, and cause a gpu pipeline stall?

@@ -0,0 +1,304 @@
import numpy as np
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we only have test cases for one kernel (though two loops are decomposed into two offloaded tasks). Shall we add test cases for multiple kernels?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by 'multiple kernels'? I'm only intending to support cuda graph for a single Quadrants kernel, containing multiple top level loops. This is an intentional design decision.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. I was thinking a large graph which can record all kenrel launches together. But if it is designed for single kernel, then the tests look good to me.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so what I'm imagining in my head is that instead of eg

@qd.kernel
def linesearch_kernel(...):
    for i_b in range(B):
        ...

@qd.kernel
def hessian_kernel(...):
    for i_b in range(B):
        ...

@qd.kernel
def check_not_done_kernel(...):
    for i_b in range(B):
        ...

def python_solver_loop(...):
      while not_done:
           linesearch_kernel(...)
           hessian_kernel(...)
           not_done = check_not_done_kernel(...)

We'd have:

@qd.func
def linesearch_func(...):
    for i_b in range(B):
        ...

@qd.func
def hessian_func(...):
    for i_b in range(B):
        ...

@qd.func
def check_not_done_func(...):
    for i_b in range(B):
        ...

@qd.kernel(graph_while=not_done)
def graph_solver_loop(not_done, ...):
     linesearch_func(...)
     hessian_func(...)
     check_not_done_func(...)

Thoughts? (Note that changing how this looks might not be tons of work, depending on the desired approach, so feel free to propose how you are imagining this might look)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I can do if you want is to have some tests that call @qd.func's, rather than having 'naked' for loops inside the cuda graph function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see test_cuda_graph_multi_func

Expose the number of offloaded tasks (parallel-for loops) as a
compile-time property on CompiledKernelData, captured per launch in
Program and accessible via pybind11. Assert expected task counts in
cuda graph tests.

Made-with: Cursor
Track the number of kernel nodes in each cached CUDA graph and expose
it via get_cuda_graph_num_nodes_on_last_call() through the full
CudaGraphManager → KernelLauncher → Program → pybind11 chain. Assert
node counts match offloaded task counts in all cuda graph tests.

Made-with: Cursor
Test a kernel that calls three @qd.func with 2, 4, and 3 top-level
for loops respectively, asserting 9 offloaded tasks and 9 graph nodes.

Made-with: Cursor
@hughperkins hughperkins force-pushed the hp/cuda-graph-mvp-1-graph-build branch from 32b8d65 to 470b073 Compare March 14, 2026 20:28
The LLVM x64 backend generates extra tasks per ndarray argument for
serialization/setup, so exact equality checks fail. Use >= instead.

Made-with: Cursor
Ndarray kernels can produce additional serial tasks beyond the
user-visible loops, so hardcoding expected node counts breaks.
Use the actual num_offloaded_tasks instead.
@hughperkins hughperkins merged commit aada310 into main Mar 16, 2026
47 checks passed
@hughperkins hughperkins deleted the hp/cuda-graph-mvp-1-graph-build branch March 16, 2026 16:14
hughperkins added a commit that referenced this pull request Mar 16, 2026
Resolve conflicts from squash-merged MVP-1 PR (#405) vs branch's
pre-existing MVP-1 merge commits. Keep all graph_do_while (MVP-2)
additions. Incorporate grad_ptr local variable cleanup from main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants