[PERF IMPROVEMENT] enable decomposed solver on AMDGPU + route gradient update through tiled dispatcher (stacks on #15)#16
Closed
lohiaj wants to merge 2 commits intoROCm:amd-integrationfrom
Closed
Conversation
…ss-matrix back-solve Adds `func_solve_mass_tiled` in abd/forward_dynamics.py: a wave-cooperative LDL^T back-solve for the CG solver's preconditioner step (M @ Mgrad = grad). Structure mirrors the existing `func_cholesky_solve_tiled` (Newton path, LL^T). One wavefront (BLOCK_DIM=64) per (entity, env). LDS caches the entity-local lower-triangle of `mass_mat_L`, `mass_mat_D_inv`, and the working vector. Three phases (L^T solve, D^-1 scale, L solve); inner dot products reduced across lanes via warp shuffle on CUDA or an LDS partial-sum buffer on AMDGPU (same pattern as the Newton variant). Wired into `func_update_gradient_tiled`'s CG branch, replacing the per-env serial `for i_b in range(_B): func_solve_mass_batch(...)` loop. Gated on `static_rigid_sim_config.enable_tiled_cholesky_hessian` (already auto-enabled for n_dofs >= 16, lds budget OK, n_envs <= 16384). The Newton branch is unchanged. Correctness (256 envs x 50 steps FP32, differential test vs baseline): q_mean/q_std/q_min/q_max: identical v_mean/v_std/v_min/v_max: 1-2e-6 drift (FP32 noise floor) n_contacts_sum: 2048 == 2048 (identical) 500-step run completes without NaN at 8192 envs Performance (MI300X, 8192 envs x 500 steps x FP32, quadrants a732b474, 5-trial interleaved stash/pop): baseline: 535,261 env*steps/s (median of 534k-547k) this PR: 571,722 env*steps/s (median of 570k-577k) delta: +6.81% median (every paired trial positive) Scaling sweep at 100 steps (n_envs in 256/1024/4096/8192): all clean, no regressions. Scope: 1 new @qd.func (~150 LoC) + 10 lines replaced in func_update_gradient_tiled CG branch. No changes to the monolith, sparse_solve path, or Newton path. Projected pipeline impact: at pipeline build ROCm#54 baseline of 562,670 (70.84% H100), +6.81% relative would land at ~600k = 75.6% H100.
…e through tiled dispatcher Two small changes to the decomposed solver path: 1. Enable the decomposed variant for AMDGPU backend (mirrors upstream PR Genesis-Embodied-AI#2623, which cherry-picked only to main). `perf_dispatch` will now time both monolith and decomposed at warmup and pick whichever is faster per env-count. 2. Replace the decomposed `_kernel_update_gradient`'s per-env serial loop (`for i_b in range(_B): func_update_gradient_batch(i_b, ...)`) with a single call to the `func_update_gradient` dispatcher, which routes to `func_update_gradient_tiled` on AMDGPU. That in turn uses the wave-cooperative `func_solve_mass_tiled` (from the preceding PR), replacing the serial LDL^T back-solve with a BLOCK_DIM=64 cooperative one. This depends on the preceding PR ("wave-cooperative LDL^T back-solve") to be meaningful — without it the dispatcher just falls through to the same batched code. Correctness (256 envs x 50 steps FP32, differential vs pristine): q_mean/q_std/q_min/q_max: identical v_mean/v_std/v_min/v_max: 1-2e-6 drift (FP32 noise floor) n_contacts_sum: 2048 == 2048 (identical) Performance (MI300X, 8192 envs x 500 steps x FP32, 3-trial interleaved on top of PRs Genesis-Embodied-AI#14 and Genesis-Embodied-AI#15 stacked): baseline (those PRs only): 560-577k env*steps/s this PR: 576-579k env*steps/s median delta: +3.0% Behavior at smaller n_envs is governed by `perf_dispatch` timing trials and is workload-dependent; the change is safe-by-default because `perf_dispatch` picks the faster of monolith/decomposed. Caveats: - The rewritten `_kernel_update_gradient` no longer short-circuits on `improved[i_b]` (the dispatcher processes all envs). For G1-on-plane most envs stay `improved=True` for most of the solve, so the saved-compute benefit was small. If a workload has many envs converging early, this could be a small regression; `perf_dispatch` will then prefer the monolith.
Collaborator
|
Closing per our own measurement. The initial +3% claim at envs was a methodology error: git stash push on a committed file is a no-op, so my A/B runs were comparing identical code. Re-measured cleanly with commit-toggle (6 trials each): median Δ = -0.38%, well within noise. No net benefit at the customer target. The change might still help at 4K envs via perf_dispatch, but that needs re-verification with the same corrected methodology. Keeping the idea parked in case quadrants ROCm/quadrants#9 lands, at which point decomposed becomes genuinely faster across the board and this wiring is back in the money. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks on top of Genesis-Embodied-AI#15. Two small changes to the decomposed solver path:
Enable the decomposed variant for AMDGPU (mirrors upstream Genesis-Embodied-AI/Genesis#2623 which cherry-picked only to
main, never toamd-integration).perf_dispatchwill now time both monolith and decomposed at warmup and pick whichever is faster for each env-count geometry.Replace the decomposed
_kernel_update_gradient's per-env serial loop (for i_b in range(_B): func_update_gradient_batch(i_b, ...)) with a single call to thefunc_update_gradientdispatcher. On AMDGPU, that dispatcher routes tofunc_update_gradient_tiled, which (after add gitmodules file (#7) Genesis-Embodied-AI/Genesis#15) uses the wave-cooperativefunc_solve_mass_tiledfor the LDL^T back-solve.Without Genesis-Embodied-AI#15 merged first, the second change is a no-op perf-wise (the dispatcher falls through to the same batched code). Please merge Genesis-Embodied-AI#15 first.