[MISC] Add GPU graph to decomposed solver to reduce kernel launch latency.#2621
[MISC] Add GPU graph to decomposed solver to reduce kernel launch latency.#2621duburcqa merged 27 commits intoGenesis-Embodied-AI:mainfrom
Conversation
Aligns with the convention used by all other files in the constraint solver package.
Scalar i32 ndarray used by graph_do_while for GPU-side iteration control in the decomposed constraint solver.
graph_do_while requires the same physical ndarray on every call, so this must not use V_ANNOTATION which could resolve to qd.template.
Replace Python iteration loop + separate @qd.kernel calls with a single @qd.kernel(cuda_graph=True) that uses graph_do_while for GPU-side iteration. Each solver step is now a @qd.func whose top-level for-loops become separate nodes in the CUDA graph. A new _func_check_early_exit decrements the counter and sets it to 0 when all batch elements have converged, matching the monolith's early-exit behavior.
Better reflects the do-while condition semantics of graph_do_while.
CUDA graph does not support autograd (ndarray args with non-null gradient pointers are rejected). Fall back to monolith solver in differentiable mode.
The is_compatible condition had != instead of ==, causing the CUDA-graph decomposed solver to be selected on CPU/Metal where qd.simt.block.sync() is unsupported.
|
|
Replace serial loop over _B batch elements (single thread reading 4096 booleans sequentially) with parallel threads that each check one improved[i_b] and store 1 into a shared flag. Adds early_exit_flag as a V(dtype=qd.i32) field (not ndarray) to ConstraintState so it can be accessed in Quadrants scope via the struct directly, unlike graph_counter which must be ndarray for graph_do_while. dex_hand bs=4096 decomposed: 8,399 → 13,458 FPS (+60%)
pyproject.toml pins quadrants==0.5.0b1 which uv refuses to install without --prerelease=allow.
Made-with: Cursor
The --prerelease=allow flag was pulling in dev/rc versions of pyglet (3.0.dev2), numba (0.65.0rc1), llvmlite (0.47.0rc1), pydantic (2.13.0b2), and others. pyglet 3.0.dev2 is the most likely cause of the EGL_BAD_ALLOC crashes during test setup. The flag is unnecessary: uv already installs exact-pinned pre-releases like quadrants==0.5.0b1 without it.
The parallel for loop in _func_check_early_exit writes to the same scalar from multiple threads. Use qd.atomic_max instead of plain assignment to avoid a data race.
The parallel early exit check introduced a separate kernel and early_exit_flag field. Revert to the original serial loop inside _func_check_early_exit that checks improved[i_b] sequentially. Made-with: Cursor
This is not a good reason. shared memory is already perfectly supported on Apple Metal, Vulkan and AMDGPU. What is not supported is atomics on shared memory, and non-32bits dtype. |
| dependencies = [ | ||
| "psutil", | ||
| "quadrants==0.4.5", | ||
| "quadrants==0.5.0b1", |
| constraint_state, | ||
| rigid_global_info, | ||
| static_rigid_sim_config, | ||
| _n_iterations: not static_rigid_sim_config.requires_grad and static_rigid_sim_config.backend == gs.cuda, |
There was a problem hiding this comment.
Got it. Will take a look.
Either way:
|
Why? it was always supported on Mac. |
Sure. But then update the documentation accordingly. Instead of mentioning shared memory related issues. |
|
Note: the failures are pending a new pre-release of quadrants that addresses https://github.com/Genesis-Embodied-AI/quadrants/pulls |
This reverts commit 430140d.
Resolve conflict in solver_breakdown.py: keep cuda-graph branch's GPU graph functions and cuda-specific is_compatible check.
|
|
|
Woo, +50% FPS on dex_hand field 🙌 |
|
|
Description
Add GPU graph to decomposed solver to reduce kernel launch latency.
After this PR, decomposed always uses gpu graph, with one exception: if autodiff is needed.
Related Issue
Resolves Genesis-Embodied-AI/Genesis#
Motivation and Context
How Has This Been / Can This Be Tested?
Screenshots (if appropriate):
Checklist:
Submitting Code Changessection of CONTRIBUTING document.