Skip to content

[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15)#16

Closed
lohiaj wants to merge 2 commits intoROCm:amd-integrationfrom
lohiaj:perf/jlohia/decomp-on-amd
Closed

[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15)#16
lohiaj wants to merge 2 commits intoROCm:amd-integrationfrom
lohiaj:perf/jlohia/decomp-on-amd

Conversation

@lohiaj
Copy link
Copy Markdown

@lohiaj lohiaj commented Apr 23, 2026

Summary

Stacks on top of Genesis-Embodied-AI#15. Two small changes to the decomposed solver path:

  1. Enable the decomposed variant for AMDGPU (mirrors upstream Genesis-Embodied-AI/Genesis#2623 which cherry-picked only to main, never to amd-integration). perf_dispatch will now time both monolith and decomposed at warmup and pick whichever is faster for each env-count geometry.

  2. Replace the decomposed _kernel_update_gradient's per-env serial loop (for i_b in range(_B): func_update_gradient_batch(i_b, ...)) with a single call to the func_update_gradient dispatcher. On AMDGPU, that dispatcher routes to func_update_gradient_tiled, which (after add gitmodules file (#7) Genesis-Embodied-AI/Genesis#15) uses the wave-cooperative func_solve_mass_tiled for the LDL^T back-solve.

Without Genesis-Embodied-AI#15 merged first, the second change is a no-op perf-wise (the dispatcher falls through to the same batched code). Please merge Genesis-Embodied-AI#15 first.

lohiaj added 2 commits April 23, 2026 09:15
…ss-matrix back-solve

Adds `func_solve_mass_tiled` in abd/forward_dynamics.py: a wave-cooperative
LDL^T back-solve for the CG solver's preconditioner step (M @ Mgrad = grad).

Structure mirrors the existing `func_cholesky_solve_tiled` (Newton path,
LL^T). One wavefront (BLOCK_DIM=64) per (entity, env). LDS caches the
entity-local lower-triangle of `mass_mat_L`, `mass_mat_D_inv`, and the
working vector. Three phases (L^T solve, D^-1 scale, L solve); inner
dot products reduced across lanes via warp shuffle on CUDA or an LDS
partial-sum buffer on AMDGPU (same pattern as the Newton variant).

Wired into `func_update_gradient_tiled`'s CG branch, replacing the
per-env serial `for i_b in range(_B): func_solve_mass_batch(...)` loop.
Gated on `static_rigid_sim_config.enable_tiled_cholesky_hessian` (already
auto-enabled for n_dofs >= 16, lds budget OK, n_envs <= 16384).

The Newton branch is unchanged.

Correctness (256 envs x 50 steps FP32, differential test vs baseline):
  q_mean/q_std/q_min/q_max: identical
  v_mean/v_std/v_min/v_max: 1-2e-6 drift (FP32 noise floor)
  n_contacts_sum: 2048 == 2048 (identical)
  500-step run completes without NaN at 8192 envs

Performance (MI300X, 8192 envs x 500 steps x FP32, quadrants a732b474,
5-trial interleaved stash/pop):
  baseline: 535,261 env*steps/s  (median of 534k-547k)
  this PR:  571,722 env*steps/s  (median of 570k-577k)
  delta:    +6.81% median (every paired trial positive)

Scaling sweep at 100 steps (n_envs in 256/1024/4096/8192): all clean,
no regressions.

Scope: 1 new @qd.func (~150 LoC) + 10 lines replaced in
func_update_gradient_tiled CG branch. No changes to the monolith,
sparse_solve path, or Newton path.

Projected pipeline impact: at pipeline build ROCm#54 baseline of 562,670
(70.84% H100), +6.81% relative would land at ~600k = 75.6% H100.
…e through tiled dispatcher

Two small changes to the decomposed solver path:

1. Enable the decomposed variant for AMDGPU backend (mirrors upstream PR
   Genesis-Embodied-AI#2623, which cherry-picked only to main).
   `perf_dispatch` will now time both monolith and decomposed at warmup
   and pick whichever is faster per env-count.

2. Replace the decomposed `_kernel_update_gradient`'s per-env serial loop
   (`for i_b in range(_B): func_update_gradient_batch(i_b, ...)`) with a
   single call to the `func_update_gradient` dispatcher, which routes to
   `func_update_gradient_tiled` on AMDGPU. That in turn uses the
   wave-cooperative `func_solve_mass_tiled` (from the preceding PR),
   replacing the serial LDL^T back-solve with a BLOCK_DIM=64 cooperative
   one.

This depends on the preceding PR ("wave-cooperative LDL^T back-solve")
to be meaningful — without it the dispatcher just falls through to the
same batched code.

Correctness (256 envs x 50 steps FP32, differential vs pristine):
  q_mean/q_std/q_min/q_max: identical
  v_mean/v_std/v_min/v_max: 1-2e-6 drift (FP32 noise floor)
  n_contacts_sum: 2048 == 2048 (identical)

Performance (MI300X, 8192 envs x 500 steps x FP32, 3-trial interleaved
on top of PRs Genesis-Embodied-AI#14 and Genesis-Embodied-AI#15 stacked):
  baseline (those PRs only): 560-577k env*steps/s
  this PR:                   576-579k env*steps/s
  median delta:              +3.0%

Behavior at smaller n_envs is governed by `perf_dispatch` timing trials
and is workload-dependent; the change is safe-by-default because
`perf_dispatch` picks the faster of monolith/decomposed.

Caveats:
- The rewritten `_kernel_update_gradient` no longer short-circuits on
  `improved[i_b]` (the dispatcher processes all envs). For G1-on-plane
  most envs stay `improved=True` for most of the solve, so the
  saved-compute benefit was small. If a workload has many envs
  converging early, this could be a small regression; `perf_dispatch`
  will then prefer the monolith.
@lohiaj lohiaj closed this Apr 23, 2026
@lohiaj lohiaj deleted the perf/jlohia/decomp-on-amd branch April 23, 2026 11:47
@ROCm ROCm deleted a comment from lohiaj Apr 23, 2026
@yaoliu13
Copy link
Copy Markdown
Collaborator

Closing per our own measurement. The initial +3% claim at envs was a methodology error: git stash push on a committed file is a no-op, so my A/B runs were comparing identical code. Re-measured cleanly with commit-toggle (6 trials each): median Δ = -0.38%, well within noise. No net benefit at the customer target.

The change might still help at 4K envs via perf_dispatch, but that needs re-verification with the same corrected methodology. Keeping the idea parked in case quadrants ROCm/quadrants#9 lands, at which point decomposed becomes genuinely faster across the board and this wiring is back in the money.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants