[PERF IMPROVEMENT] func_broad_phase: hoist loop-invariants, cache i_pair, swap-and-pop, gate dead writes#32
Closed
lohiaj wants to merge 1 commit intoamd-integrationfrom
Closed
Conversation
…air, swap-and-pop, gate dead writes
Four orthogonal cleanups in genesis/engine/solvers/rigid/collider/broadphase.py.
All four preserve the SAP set-of-overlapping-pairs invariant.
(1) Hoist loop-invariant sort_buffer reads. The sweep loop's outer index `i`
and batch index `i_b` are constant across the inner per-active loop,
but `sort_buffer.is_max[i, i_b]` and `sort_buffer.i_g[i, i_b]` were
re-loaded once per inner iteration (the DSL/compiler can't safely CSE
across calls to func_check_collision_valid and func_is_geom_aabbs_overlap).
Hoisted into locals `is_max_i` and `i_g_i` once per outer step. Same fix
in both the no-hibernation and hibernation branches.
(2) Cache collision_pair_idx[i_ga, i_gb] once per pair. It was loaded once
inside func_check_collision_valid and again, separately, on the
no-overlap fall-through path that clears contact_cache.normal — two
indirect loads of the same value with a function call in between.
Now loaded once before the validity call and threaded through as the
`i_pair` parameter; func_check_collision_valid's signature drops the
now-unused `collider_info` argument. All three call sites updated.
(3) Swap-and-pop in the active-buffer removal paths. The original
"linear search + linear shift to preserve insertion order" cost
O(n_active) per removed geom; SAP only reads active_buffer as a set
(no order dependency in the inner check), so swap-and-pop reduces
each removal to O(1) and eliminates the inner `for k in range(j, ...)`
shift. Applied to all three removal sites
(active_buffer, active_buffer_hib, active_buffer_awake).
(4) Gate the min_buffer_idx / max_buffer_idx writes in the first-time
initialization on `qd.static(use_hibernation)`. Those two arrays are
only read inside hibernation-gated branches, so without hibernation
the writes are dead stores; the static gate elides them at compile time.
Correctness verified via differential test against the pristine HEAD
(256 envs x 50 steps, FP32): position/velocity statistics match to within
floating-point noise floor; n_contacts_sum identical (2048 == 2048).
Scope: single file, +40 / -36 = +4 LoC net.
Collaborator
|
This PR is not based on the latest amd-integration: perf/jlohia/broadphase-cleanup...ROCm:Genesis:amd-integration |
Author
|
Closing this in favor of the monolith-focused PR #34. The broadphase cleanup is small and no longer clears the current release-baseline bar for E2E impact. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four orthogonal cleanups in
genesis/engine/solvers/rigid/collider/broadphase.py, all in the hotfunc_broad_phase_*_kernel_2_range_for. They preserve the SAP set-of-overlapping-pairs invariant.Hoist loop-invariant
sort_bufferreads out of the inner per-active sweep loop. The outeriand batchi_bare constant across the innerfor j in range(n_active), but the DSL/compiler can't safely CSE through the calls tofunc_check_collision_validandfunc_is_geom_aabbs_overlap, sosort_buffer.is_max[i, i_b]andsort_buffer.i_g[i, i_b]were re-loaded once per inner iteration. Now hoisted intois_max_iandi_g_ionce per outer step. Same fix in both the no-hibernation and hibernation branches.Cache
collision_pair_idx[i_ga, i_gb]once per pair. It was loaded once insidefunc_check_collision_valid(validity check) and again, separately, on the no-overlap fall-through that clearscontact_cache.normal— two indirect loads of the same index with a function call in between. Now loaded once before the validity call and threaded through asi_pair;func_check_collision_valid's signature gains thei_pairparameter and drops the now-unusedcollider_info. All three call sites updated.Swap-and-pop active-buffer removal. The original linear search + linear shift to preserve insertion order cost O(n_active) per removed geom. SAP only reads
active_bufferas a set in the inner pair check, so swap-and-pop preserves correctness and reduces each removal to O(1). The innerfor k in range(j, n_active - 1)shift is gone. Applied to all three removal sites (active_buffer,active_buffer_hib,active_buffer_awake).Gate dead
min_buffer_idx/max_buffer_idxwrites in the first-time init path onqd.static(use_hibernation). Those two arrays are only read inside hibernation-gated branches, so without hibernation the writes are dead stores; the static gate elides them at compile time.Correctness
Differential test vs pristine
amd-integrationHEAD: position and velocity statistics match to within floating-point noise; contact counts identical.Scope
Single file, +40 / -36 = +4 LoC net. No changes outside
broadphase.py.E2E at 8192 envs:
throughput: 709,923.7 env*steps/s
wall_time: 5.770 s