[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15) by lohiaj · Pull Request #16 · ROCm/Genesis

lohiaj · 2026-04-23T11:10:00Z

Summary

Stacks on top of Genesis-Embodied-AI#15. Two small changes to the decomposed solver path:

Enable the decomposed variant for AMDGPU (mirrors upstream Genesis-Embodied-AI/Genesis#2623 which cherry-picked only to main, never to amd-integration). perf_dispatch will now time both monolith and decomposed at warmup and pick whichever is faster for each env-count geometry.
Replace the decomposed _kernel_update_gradient's per-env serial loop (for i_b in range(_B): func_update_gradient_batch(i_b, ...)) with a single call to the func_update_gradient dispatcher. On AMDGPU, that dispatcher routes to func_update_gradient_tiled, which (after add gitmodules file (#7) Genesis-Embodied-AI/Genesis#15) uses the wave-cooperative func_solve_mass_tiled for the LDL^T back-solve.

Without Genesis-Embodied-AI#15 merged first, the second change is a no-op perf-wise (the dispatcher falls through to the same batched code). Please merge Genesis-Embodied-AI#15 first.

…ss-matrix back-solve Adds `func_solve_mass_tiled` in abd/forward_dynamics.py: a wave-cooperative LDL^T back-solve for the CG solver's preconditioner step (M @ Mgrad = grad). Structure mirrors the existing `func_cholesky_solve_tiled` (Newton path, LL^T). One wavefront (BLOCK_DIM=64) per (entity, env). LDS caches the entity-local lower-triangle of `mass_mat_L`, `mass_mat_D_inv`, and the working vector. Three phases (L^T solve, D^-1 scale, L solve); inner dot products reduced across lanes via warp shuffle on CUDA or an LDS partial-sum buffer on AMDGPU (same pattern as the Newton variant). Wired into `func_update_gradient_tiled`'s CG branch, replacing the per-env serial `for i_b in range(_B): func_solve_mass_batch(...)` loop. Gated on `static_rigid_sim_config.enable_tiled_cholesky_hessian` (already auto-enabled for n_dofs >= 16, lds budget OK, n_envs <= 16384). The Newton branch is unchanged. Correctness (256 envs x 50 steps FP32, differential test vs baseline): q_mean/q_std/q_min/q_max: identical v_mean/v_std/v_min/v_max: 1-2e-6 drift (FP32 noise floor) n_contacts_sum: 2048 == 2048 (identical) 500-step run completes without NaN at 8192 envs Performance (MI300X, 8192 envs x 500 steps x FP32, quadrants a732b474, 5-trial interleaved stash/pop): baseline: 535,261 env*steps/s (median of 534k-547k) this PR: 571,722 env*steps/s (median of 570k-577k) delta: +6.81% median (every paired trial positive) Scaling sweep at 100 steps (n_envs in 256/1024/4096/8192): all clean, no regressions. Scope: 1 new @qd.func (~150 LoC) + 10 lines replaced in func_update_gradient_tiled CG branch. No changes to the monolith, sparse_solve path, or Newton path. Projected pipeline impact: at pipeline build ROCm#54 baseline of 562,670 (70.84% H100), +6.81% relative would land at ~600k = 75.6% H100.

…e through tiled dispatcher Two small changes to the decomposed solver path: 1. Enable the decomposed variant for AMDGPU backend (mirrors upstream PR Genesis-Embodied-AI#2623, which cherry-picked only to main). `perf_dispatch` will now time both monolith and decomposed at warmup and pick whichever is faster per env-count. 2. Replace the decomposed `_kernel_update_gradient`'s per-env serial loop (`for i_b in range(_B): func_update_gradient_batch(i_b, ...)`) with a single call to the `func_update_gradient` dispatcher, which routes to `func_update_gradient_tiled` on AMDGPU. That in turn uses the wave-cooperative `func_solve_mass_tiled` (from the preceding PR), replacing the serial LDL^T back-solve with a BLOCK_DIM=64 cooperative one. This depends on the preceding PR ("wave-cooperative LDL^T back-solve") to be meaningful — without it the dispatcher just falls through to the same batched code. Correctness (256 envs x 50 steps FP32, differential vs pristine): q_mean/q_std/q_min/q_max: identical v_mean/v_std/v_min/v_max: 1-2e-6 drift (FP32 noise floor) n_contacts_sum: 2048 == 2048 (identical) Performance (MI300X, 8192 envs x 500 steps x FP32, 3-trial interleaved on top of PRs Genesis-Embodied-AI#14 and Genesis-Embodied-AI#15 stacked): baseline (those PRs only): 560-577k env*steps/s this PR: 576-579k env*steps/s median delta: +3.0% Behavior at smaller n_envs is governed by `perf_dispatch` timing trials and is workload-dependent; the change is safe-by-default because `perf_dispatch` picks the faster of monolith/decomposed. Caveats: - The rewritten `_kernel_update_gradient` no longer short-circuits on `improved[i_b]` (the dispatcher processes all envs). For G1-on-plane most envs stay `improved=True` for most of the solve, so the saved-compute benefit was small. If a workload has many envs converging early, this could be a small regression; `perf_dispatch` will then prefer the monolith.

yaoliu13 · 2026-04-23T20:21:40Z

Closing per our own measurement. The initial +3% claim at envs was a methodology error: git stash push on a committed file is a no-op, so my A/B runs were comparing identical code. Re-measured cleanly with commit-toggle (6 trials each): median Δ = -0.38%, well within noise. No net benefit at the customer target.

The change might still help at 4K envs via perf_dispatch, but that needs re-verification with the same corrected methodology. Keeping the idea parked in case quadrants ROCm/quadrants#9 lands, at which point decomposed becomes genuinely faster across the board and this wiring is back in the money.

lohiaj added 2 commits April 23, 2026 09:15

lohiaj closed this Apr 23, 2026

lohiaj deleted the perf/jlohia/decomp-on-amd branch April 23, 2026 11:47

ROCm deleted a comment from lohiaj Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15)#16

[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15)#16
lohiaj wants to merge 2 commits intoROCm:amd-integrationfrom
lohiaj:perf/jlohia/decomp-on-amd

lohiaj commented Apr 23, 2026 •

edited by diptorupd

Loading

Uh oh!

yaoliu13 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lohiaj commented Apr 23, 2026 • edited by diptorupd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

yaoliu13 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lohiaj commented Apr 23, 2026 •

edited by diptorupd

Loading