[MISC] Add GPU graph to decomposed solver to reduce kernel launch latency. by hughperkins · Pull Request #2621 · Genesis-Embodied-AI/Genesis

hughperkins · 2026-03-29T20:03:15Z

Description

Add GPU graph to decomposed solver to reduce kernel launch latency.

on SM90+, the entire loop, including checking whether to continue or not, happens on the GPU
on older CUDA GPUs, the body of the loop is launched by a single CUDA call, and the entire body runs on the GPU, with no further interaction with the host
- checking for loop termination happens on the host side, in C++, but requires copying the loop variable from GPU to CPU
on other GPUs, the body of the loop runs as per any other Quadrants kernel with multiple top-level loops:
- that is, each top-level loop is launched from the C++ side, as a separate GPU kernel
- as per pre-SM 90 CUDA GPUs, checking for loop termination happens on the C++ side, and incurs a GPU to CPU copy, for the integer counter, each iteration
oh, actually:
- whilst quadrants gpu graph graph can run itself on all backends
- current decomposed hessian kernel only works with shared memory, and so doesn't work on cpu
- in addition, decomp on main is set to only run for CUDA
  - we'll preserve this behavior on this branch for now

After this PR, decomposed always uses gpu graph, with one exception: if autodiff is needed.

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

I read the CONTRIBUTING document.
I followed the Submitting Code Changes section of CONTRIBUTING document.
I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
I updated the documentation accordingly or no change is needed.
I tested my changes and added instructions on how to test it for reviewers.

I have added tests to cover my changes.
All new and existing tests passed.

Aligns with the convention used by all other files in the constraint solver package.

Scalar i32 ndarray used by graph_do_while for GPU-side iteration control in the decomposed constraint solver.

graph_do_while requires the same physical ndarray on every call, so this must not use V_ANNOTATION which could resolve to qd.template.

Replace Python iteration loop + separate @qd.kernel calls with a single @qd.kernel(cuda_graph=True) that uses graph_do_while for GPU-side iteration. Each solver step is now a @qd.func whose top-level for-loops become separate nodes in the CUDA graph. A new _func_check_early_exit decrements the counter and sets it to 0 when all batch elements have converged, matching the monolith's early-exit behavior.

Better reflects the do-while condition semantics of graph_do_while.

CUDA graph does not support autograd (ndarray args with non-null gradient pointers are rejected). Fall back to monolith solver in differentiable mode.

The is_compatible condition had != instead of ==, causing the CUDA-graph decomposed solver to be selected on CPU/Metal where qd.simt.block.sync() is unsupported.

github-actions · 2026-03-30T02:23:10Z

⚠️ Abnormal Benchmark Result Detected ➡️ Report

Replace serial loop over _B batch elements (single thread reading 4096 booleans sequentially) with parallel threads that each check one improved[i_b] and store 1 into a shared flag. Adds early_exit_flag as a V(dtype=qd.i32) field (not ndarray) to ConstraintState so it can be accessed in Quadrants scope via the struct directly, unlike graph_counter which must be ndarray for graph_do_while. dex_hand bs=4096 decomposed: 8,399 → 13,458 FPS (+60%)

…cuda-graph

pyproject.toml pins quadrants==0.5.0b1 which uv refuses to install without --prerelease=allow.

Made-with: Cursor

The --prerelease=allow flag was pulling in dev/rc versions of pyglet (3.0.dev2), numba (0.65.0rc1), llvmlite (0.47.0rc1), pydantic (2.13.0b2), and others. pyglet 3.0.dev2 is the most likely cause of the EGL_BAD_ALLOC crashes during test setup. The flag is unnecessary: uv already installs exact-pinned pre-releases like quadrants==0.5.0b1 without it.

The parallel for loop in _func_check_early_exit writes to the same scalar from multiple threads. Use qd.atomic_max instead of plain assignment to avoid a data race.

The parallel early exit check introduced a separate kernel and early_exit_flag field. Revert to the original serial loop inside _func_check_early_exit that checks improved[i_b] sequentially. Made-with: Cursor

duburcqa · 2026-03-30T16:07:34Z

current decomposed hessian kernel only works with shared memory, and so decomp on main is set to only run for CUDA

This is not a good reason. shared memory is already perfectly supported on Apple Metal, Vulkan and AMDGPU. What is not supported is atomics on shared memory, and non-32bits dtype.

duburcqa · 2026-03-30T16:24:14Z

 dependencies = [
    "psutil",
-    "quadrants==0.4.5",
+    "quadrants==0.5.0b1",


duburcqa · 2026-03-30T16:24:45Z

+    constraint_state,
+    rigid_global_info,
+    static_rigid_sim_config,
+    _n_iterations: not static_rigid_sim_config.requires_grad and static_rigid_sim_config.backend == gs.cuda,


Let's see #2623

Got it. Will take a look.

hughperkins · 2026-03-30T16:25:49Z

current decomposed hessian kernel only works with shared memory, and so decomp on main is set to only run for CUDA

This is not a good reason. shared memory is already perfectly supported on Apple Metal, Vulkan and AMDGPU. What is not supported is atomics on shared memory, and non-32bits dtype.

Either way:

the hessian tiled wont run on Mac or CPU
the goal of this PR is not to turn on decomp on new paltforms, but to migrate decomp to use gpu graph

duburcqa · 2026-03-30T16:31:01Z

the hessian tiled wont run on Mac or CPU

Why? it was always supported on Mac.

duburcqa · 2026-03-30T16:31:40Z

the goal of this PR is not to turn on decomp on new paltforms, but to migrate decomp to use gpu graph

Sure. But then update the documentation accordingly. Instead of mentioning shared memory related issues.

hughperkins · 2026-03-30T17:55:54Z

Note: the failures are pending a new pre-release of quadrants that addresses https://github.com/Genesis-Embodied-AI/quadrants/pulls

This reverts commit 430140d.

Resolve conflict in solver_breakdown.py: keep cuda-graph branch's GPU graph functions and cuda-specific is_compatible check.

github-actions · 2026-03-31T02:05:51Z

⚠️ Abnormal Benchmark Result Detected ➡️ Report

hughperkins · 2026-03-31T02:18:35Z

Woo, +50% FPS on dex_hand field 🙌

github-actions · 2026-03-31T05:39:11Z

⚠️ Abnormal Benchmark Result Detected ➡️ Report

…ency. (Genesis-Embodied-AI#2621)

hughperkins added 15 commits March 29, 2026 03:04

Rename quadrants alias from ti to qd in solver_breakdown.py

4283d56

Aligns with the convention used by all other files in the constraint solver package.

Add cuda_graph_counter ndarray to StructConstraintState

fc94ee6

Scalar i32 ndarray used by graph_do_while for GPU-side iteration control in the decomposed constraint solver.

Fix cuda_graph_counter annotation to always be ndarray

1aeec31

graph_do_while requires the same physical ndarray on every call, so this must not use V_ANNOTATION which could resolve to qd.template.

Rename cuda_graph_counter to graph_continue_loop

62edd2a

Better reflects the do-while condition semantics of graph_do_while.

Rename graph_continue_loop to graph_counter

310e083

Disable CUDA graph solver when requires_grad is active

1b4977f

CUDA graph does not support autograd (ndarray args with non-null gradient pointers are rejected). Fall back to monolith solver in differentiable mode.

update quadrants version

7ed9707

b2

a3d5798

upgrade Genesis, and rename from cuda_graph to gpu_graph

a460681

always run if not autodiff

6635211

precommit

cb0c533

Merge branch 'main' into hp/cuda-graph

b182d2f

only run decmposed on cuda

bd76fb2

fix: use decomposed solver only on CUDA backend, not non-CUDA

d5abc21

The is_compatible condition had != instead of ==, causing the CUDA-graph decomposed solver to be selected on CPU/Metal where qd.simt.block.sync() is unsupported.

hughperkins marked this pull request as ready for review March 30, 2026 02:33

hughperkins requested review from YilingQiao and duburcqa as code owners March 30, 2026 02:33

hughperkins added 6 commits March 29, 2026 22:43

Merge remote-tracking branch 'myself/hp/cuda-graph-par-exit' into hp/…

329fb75

…cuda-graph

Allow pre-release quadrants in production_build.sh

c3dceb8

pyproject.toml pins quadrants==0.5.0b1 which uv refuses to install without --prerelease=allow.

Rewrap docstring comments to 120-char line width

dad9c23

Made-with: Cursor

Use atomic_max for early_exit_flag in parallel early exit check

a47e95f

The parallel for loop in _func_check_early_exit writes to the same scalar from multiple threads. Use qd.atomic_max instead of plain assignment to avoid a data race.

Revert early_exit_flag parallel kernel back to serial loop

430140d

The parallel early exit check introduced a separate kernel and early_exit_flag field. Revert to the original serial loop inside _func_check_early_exit that checks improved[i_b] sequentially. Made-with: Cursor

duburcqa reviewed Mar 30, 2026

View reviewed changes

duburcqa previously approved these changes Mar 30, 2026

View reviewed changes

duburcqa reviewed Mar 30, 2026

View reviewed changes

b2

9e5f9ae

hughperkins dismissed duburcqa’s stale review via 9e5f9ae March 30, 2026 19:12

hughperkins added 3 commits March 30, 2026 13:26

Revert "Revert early_exit_flag parallel kernel back to serial loop"

d0ed2c3

This reverts commit 430140d.

Merge origin/main into hp/cuda-graph

7370643

Resolve conflict in solver_breakdown.py: keep cuda-graph branch's GPU graph functions and cuda-specific is_compatible check.

Broaden cuda-graph solver compatibility to all GPU backends

6b3c15f

quadrants 0.5.0

7bab0c7

duburcqa approved these changes Mar 31, 2026

View reviewed changes

duburcqa merged commit 788bfd5 into Genesis-Embodied-AI:main Mar 31, 2026
22 of 23 checks passed

EricKing626 pushed a commit to ROCm/Genesis that referenced this pull request Apr 24, 2026

[MISC] Add GPU graph to decomposed solver to reduce kernel launch lat…

1f72c21

…ency. (Genesis-Embodied-AI#2621)

hughperkins deleted the hp/cuda-graph branch May 3, 2026 16:22

Conversation

hughperkins commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

duburcqa commented Mar 30, 2026

Uh oh!

duburcqa Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented Mar 30, 2026

Uh oh!

duburcqa commented Mar 30, 2026

Uh oh!

duburcqa commented Mar 30, 2026

Uh oh!

hughperkins commented Mar 30, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

hughperkins commented Mar 31, 2026

Uh oh!

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Mar 29, 2026 •

edited

Loading