Skip to content

[MISC] Add GPU graph to decomposed solver to reduce kernel launch latency.#2621

Merged
duburcqa merged 27 commits intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/cuda-graph
Mar 31, 2026
Merged

[MISC] Add GPU graph to decomposed solver to reduce kernel launch latency.#2621
duburcqa merged 27 commits intoGenesis-Embodied-AI:mainfrom
hughperkins:hp/cuda-graph

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Mar 29, 2026

Description

Add GPU graph to decomposed solver to reduce kernel launch latency.

  • on SM90+, the entire loop, including checking whether to continue or not, happens on the GPU
  • on older CUDA GPUs, the body of the loop is launched by a single CUDA call, and the entire body runs on the GPU, with no further interaction with the host
    • checking for loop termination happens on the host side, in C++, but requires copying the loop variable from GPU to CPU
  • on other GPUs, the body of the loop runs as per any other Quadrants kernel with multiple top-level loops:
    • that is, each top-level loop is launched from the C++ side, as a separate GPU kernel
    • as per pre-SM 90 CUDA GPUs, checking for loop termination happens on the C++ side, and incurs a GPU to CPU copy, for the integer counter, each iteration
  • oh, actually:
    • whilst quadrants gpu graph graph can run itself on all backends
    • current decomposed hessian kernel only works with shared memory, and so doesn't work on cpu
    • in addition, decomp on main is set to only run for CUDA
      • we'll preserve this behavior on this branch for now

After this PR, decomposed always uses gpu graph, with one exception: if autodiff is needed.

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Aligns with the convention used by all other files in the constraint solver package.
Scalar i32 ndarray used by graph_do_while for GPU-side iteration
control in the decomposed constraint solver.
graph_do_while requires the same physical ndarray on every call,
so this must not use V_ANNOTATION which could resolve to qd.template.
Replace Python iteration loop + separate @qd.kernel calls with a single
@qd.kernel(cuda_graph=True) that uses graph_do_while for GPU-side iteration.

Each solver step is now a @qd.func whose top-level for-loops become separate
nodes in the CUDA graph. A new _func_check_early_exit decrements the counter
and sets it to 0 when all batch elements have converged, matching the monolith's
early-exit behavior.
Better reflects the do-while condition semantics of graph_do_while.
CUDA graph does not support autograd (ndarray args with non-null gradient
pointers are rejected). Fall back to monolith solver in differentiable mode.
The is_compatible condition had != instead of ==, causing the
CUDA-graph decomposed solver to be selected on CPU/Metal where
qd.simt.block.sync() is unsupported.
@github-actions
Copy link
Copy Markdown

⚠️ Abnormal Benchmark Result Detected ➡️ Report

Replace serial loop over _B batch elements (single thread reading
4096 booleans sequentially) with parallel threads that each check
one improved[i_b] and store 1 into a shared flag.

Adds early_exit_flag as a V(dtype=qd.i32) field (not ndarray) to
ConstraintState so it can be accessed in Quadrants scope via the
struct directly, unlike graph_counter which must be ndarray for
graph_do_while.

dex_hand bs=4096 decomposed: 8,399 → 13,458 FPS (+60%)
@hughperkins hughperkins marked this pull request as ready for review March 30, 2026 02:33
pyproject.toml pins quadrants==0.5.0b1 which uv refuses to install
without --prerelease=allow.
The --prerelease=allow flag was pulling in dev/rc versions of pyglet
(3.0.dev2), numba (0.65.0rc1), llvmlite (0.47.0rc1), pydantic
(2.13.0b2), and others. pyglet 3.0.dev2 is the most likely cause of
the EGL_BAD_ALLOC crashes during test setup.

The flag is unnecessary: uv already installs exact-pinned pre-releases
like quadrants==0.5.0b1 without it.
The parallel for loop in _func_check_early_exit writes to the same
scalar from multiple threads. Use qd.atomic_max instead of plain
assignment to avoid a data race.
The parallel early exit check introduced a separate kernel and
early_exit_flag field. Revert to the original serial loop inside
_func_check_early_exit that checks improved[i_b] sequentially.

Made-with: Cursor
@duburcqa
Copy link
Copy Markdown
Collaborator

current decomposed hessian kernel only works with shared memory, and so decomp on main is set to only run for CUDA

This is not a good reason. shared memory is already perfectly supported on Apple Metal, Vulkan and AMDGPU. What is not supported is atomics on shared memory, and non-32bits dtype.

Comment thread pyproject.toml Outdated
dependencies = [
"psutil",
"quadrants==0.4.5",
"quadrants==0.5.0b1",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beta.

duburcqa
duburcqa previously approved these changes Mar 30, 2026
constraint_state,
rigid_global_info,
static_rigid_sim_config,
_n_iterations: not static_rigid_sim_config.requires_grad and static_rigid_sim_config.backend == gs.cuda,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see #2623

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will take a look.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

current decomposed hessian kernel only works with shared memory, and so decomp on main is set to only run for CUDA

This is not a good reason. shared memory is already perfectly supported on Apple Metal, Vulkan and AMDGPU. What is not supported is atomics on shared memory, and non-32bits dtype.

Either way:

  • the hessian tiled wont run on Mac or CPU
  • the goal of this PR is not to turn on decomp on new paltforms, but to migrate decomp to use gpu graph

@duburcqa
Copy link
Copy Markdown
Collaborator

the hessian tiled wont run on Mac or CPU

Why? it was always supported on Mac.

@duburcqa
Copy link
Copy Markdown
Collaborator

the goal of this PR is not to turn on decomp on new paltforms, but to migrate decomp to use gpu graph

Sure. But then update the documentation accordingly. Instead of mentioning shared memory related issues.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Note: the failures are pending a new pre-release of quadrants that addresses https://github.com/Genesis-Embodied-AI/quadrants/pulls

Resolve conflict in solver_breakdown.py: keep cuda-graph branch's
GPU graph functions and cuda-specific is_compatible check.
@github-actions
Copy link
Copy Markdown

⚠️ Abnormal Benchmark Result Detected ➡️ Report

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Woo, +50% FPS on dex_hand field 🙌

@duburcqa duburcqa merged commit 788bfd5 into Genesis-Embodied-AI:main Mar 31, 2026
22 of 23 checks passed
@github-actions
Copy link
Copy Markdown

⚠️ Abnormal Benchmark Result Detected ➡️ Report

@hughperkins hughperkins deleted the hp/cuda-graph branch May 3, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants