[PERF IMPROVEMENT] func_broad_phase: hoist loop-invariants, cache i_pair, swap-and-pop, gate dead writes by lohiaj · Pull Request #32 · ROCm/Genesis

lohiaj · 2026-04-25T15:10:36Z

Summary

Four orthogonal cleanups in genesis/engine/solvers/rigid/collider/broadphase.py, all in the hot func_broad_phase_*_kernel_2_range_for. They preserve the SAP set-of-overlapping-pairs invariant.

Hoist loop-invariant sort_buffer reads out of the inner per-active sweep loop. The outer i and batch i_b are constant across the inner for j in range(n_active), but the DSL/compiler can't safely CSE through the calls to func_check_collision_valid and func_is_geom_aabbs_overlap, so sort_buffer.is_max[i, i_b] and sort_buffer.i_g[i, i_b] were re-loaded once per inner iteration. Now hoisted into is_max_i and i_g_i once per outer step. Same fix in both the no-hibernation and hibernation branches.
Cache collision_pair_idx[i_ga, i_gb] once per pair. It was loaded once inside func_check_collision_valid (validity check) and again, separately, on the no-overlap fall-through that clears contact_cache.normal — two indirect loads of the same index with a function call in between. Now loaded once before the validity call and threaded through as i_pair; func_check_collision_valid's signature gains the i_pair parameter and drops the now-unused collider_info. All three call sites updated.
Swap-and-pop active-buffer removal. The original linear search + linear shift to preserve insertion order cost O(n_active) per removed geom. SAP only reads active_buffer as a set in the inner pair check, so swap-and-pop preserves correctness and reduces each removal to O(1). The inner for k in range(j, n_active - 1) shift is gone. Applied to all three removal sites (active_buffer, active_buffer_hib, active_buffer_awake).
Gate dead min_buffer_idx / max_buffer_idx writes in the first-time init path on qd.static(use_hibernation). Those two arrays are only read inside hibernation-gated branches, so without hibernation the writes are dead stores; the static gate elides them at compile time.

Correctness

Differential test vs pristine amd-integration HEAD: position and velocity statistics match to within floating-point noise; contact counts identical.

Scope

Single file, +40 / -36 = +4 LoC net. No changes outside broadphase.py.

E2E at 8192 envs:

throughput: 709,923.7 env*steps/s
wall_time: 5.770 s

…air, swap-and-pop, gate dead writes Four orthogonal cleanups in genesis/engine/solvers/rigid/collider/broadphase.py. All four preserve the SAP set-of-overlapping-pairs invariant. (1) Hoist loop-invariant sort_buffer reads. The sweep loop's outer index `i` and batch index `i_b` are constant across the inner per-active loop, but `sort_buffer.is_max[i, i_b]` and `sort_buffer.i_g[i, i_b]` were re-loaded once per inner iteration (the DSL/compiler can't safely CSE across calls to func_check_collision_valid and func_is_geom_aabbs_overlap). Hoisted into locals `is_max_i` and `i_g_i` once per outer step. Same fix in both the no-hibernation and hibernation branches. (2) Cache collision_pair_idx[i_ga, i_gb] once per pair. It was loaded once inside func_check_collision_valid and again, separately, on the no-overlap fall-through path that clears contact_cache.normal — two indirect loads of the same value with a function call in between. Now loaded once before the validity call and threaded through as the `i_pair` parameter; func_check_collision_valid's signature drops the now-unused `collider_info` argument. All three call sites updated. (3) Swap-and-pop in the active-buffer removal paths. The original "linear search + linear shift to preserve insertion order" cost O(n_active) per removed geom; SAP only reads active_buffer as a set (no order dependency in the inner check), so swap-and-pop reduces each removal to O(1) and eliminates the inner `for k in range(j, ...)` shift. Applied to all three removal sites (active_buffer, active_buffer_hib, active_buffer_awake). (4) Gate the min_buffer_idx / max_buffer_idx writes in the first-time initialization on `qd.static(use_hibernation)`. Those two arrays are only read inside hibernation-gated branches, so without hibernation the writes are dead stores; the static gate elides them at compile time. Correctness verified via differential test against the pristine HEAD (256 envs x 50 steps, FP32): position/velocity statistics match to within floating-point noise floor; n_contacts_sum identical (2048 == 2048). Scope: single file, +40 / -36 = +4 LoC net.

yaoliu13 · 2026-04-28T00:42:12Z

This PR is not based on the latest amd-integration: perf/jlohia/broadphase-cleanup...ROCm:Genesis:amd-integration

lohiaj · 2026-04-28T05:12:48Z

Closing this in favor of the monolith-focused PR #34. The broadphase cleanup is small and no longer clears the current release-baseline bar for E2E impact.

lohiaj closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF IMPROVEMENT] func_broad_phase: hoist loop-invariants, cache i_pair, swap-and-pop, gate dead writes#32

[PERF IMPROVEMENT] func_broad_phase: hoist loop-invariants, cache i_pair, swap-and-pop, gate dead writes#32
lohiaj wants to merge 1 commit intoamd-integrationfrom
perf/jlohia/broadphase-cleanup

lohiaj commented Apr 25, 2026 •

edited

Loading

Uh oh!

yaoliu13 commented Apr 28, 2026

Uh oh!

lohiaj commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lohiaj commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness

Scope

E2E at 8192 envs:

Uh oh!

yaoliu13 commented Apr 28, 2026

Uh oh!

lohiaj commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lohiaj commented Apr 25, 2026 •

edited

Loading