Skip to content

[PERF IMPROVEMENT] func_broad_phase: hoist loop-invariants, cache i_pair, swap-and-pop, gate dead writes#32

Closed
lohiaj wants to merge 1 commit intoamd-integrationfrom
perf/jlohia/broadphase-cleanup
Closed

[PERF IMPROVEMENT] func_broad_phase: hoist loop-invariants, cache i_pair, swap-and-pop, gate dead writes#32
lohiaj wants to merge 1 commit intoamd-integrationfrom
perf/jlohia/broadphase-cleanup

Conversation

@lohiaj
Copy link
Copy Markdown

@lohiaj lohiaj commented Apr 25, 2026

Summary

Four orthogonal cleanups in genesis/engine/solvers/rigid/collider/broadphase.py, all in the hot func_broad_phase_*_kernel_2_range_for. They preserve the SAP set-of-overlapping-pairs invariant.

  1. Hoist loop-invariant sort_buffer reads out of the inner per-active sweep loop. The outer i and batch i_b are constant across the inner for j in range(n_active), but the DSL/compiler can't safely CSE through the calls to func_check_collision_valid and func_is_geom_aabbs_overlap, so sort_buffer.is_max[i, i_b] and sort_buffer.i_g[i, i_b] were re-loaded once per inner iteration. Now hoisted into is_max_i and i_g_i once per outer step. Same fix in both the no-hibernation and hibernation branches.

  2. Cache collision_pair_idx[i_ga, i_gb] once per pair. It was loaded once inside func_check_collision_valid (validity check) and again, separately, on the no-overlap fall-through that clears contact_cache.normal — two indirect loads of the same index with a function call in between. Now loaded once before the validity call and threaded through as i_pair; func_check_collision_valid's signature gains the i_pair parameter and drops the now-unused collider_info. All three call sites updated.

  3. Swap-and-pop active-buffer removal. The original linear search + linear shift to preserve insertion order cost O(n_active) per removed geom. SAP only reads active_buffer as a set in the inner pair check, so swap-and-pop preserves correctness and reduces each removal to O(1). The inner for k in range(j, n_active - 1) shift is gone. Applied to all three removal sites (active_buffer, active_buffer_hib, active_buffer_awake).

  4. Gate dead min_buffer_idx / max_buffer_idx writes in the first-time init path on qd.static(use_hibernation). Those two arrays are only read inside hibernation-gated branches, so without hibernation the writes are dead stores; the static gate elides them at compile time.

Correctness

Differential test vs pristine amd-integration HEAD: position and velocity statistics match to within floating-point noise; contact counts identical.

Scope

Single file, +40 / -36 = +4 LoC net. No changes outside broadphase.py.

E2E at 8192 envs:

throughput: 709,923.7 env*steps/s
wall_time: 5.770 s

…air, swap-and-pop, gate dead writes

Four orthogonal cleanups in genesis/engine/solvers/rigid/collider/broadphase.py.
All four preserve the SAP set-of-overlapping-pairs invariant.

(1) Hoist loop-invariant sort_buffer reads. The sweep loop's outer index `i`
    and batch index `i_b` are constant across the inner per-active loop,
    but `sort_buffer.is_max[i, i_b]` and `sort_buffer.i_g[i, i_b]` were
    re-loaded once per inner iteration (the DSL/compiler can't safely CSE
    across calls to func_check_collision_valid and func_is_geom_aabbs_overlap).
    Hoisted into locals `is_max_i` and `i_g_i` once per outer step. Same fix
    in both the no-hibernation and hibernation branches.

(2) Cache collision_pair_idx[i_ga, i_gb] once per pair. It was loaded once
    inside func_check_collision_valid and again, separately, on the
    no-overlap fall-through path that clears contact_cache.normal — two
    indirect loads of the same value with a function call in between.
    Now loaded once before the validity call and threaded through as the
    `i_pair` parameter; func_check_collision_valid's signature drops the
    now-unused `collider_info` argument. All three call sites updated.

(3) Swap-and-pop in the active-buffer removal paths. The original
    "linear search + linear shift to preserve insertion order" cost
    O(n_active) per removed geom; SAP only reads active_buffer as a set
    (no order dependency in the inner check), so swap-and-pop reduces
    each removal to O(1) and eliminates the inner `for k in range(j, ...)`
    shift. Applied to all three removal sites
    (active_buffer, active_buffer_hib, active_buffer_awake).

(4) Gate the min_buffer_idx / max_buffer_idx writes in the first-time
    initialization on `qd.static(use_hibernation)`. Those two arrays are
    only read inside hibernation-gated branches, so without hibernation
    the writes are dead stores; the static gate elides them at compile time.

Correctness verified via differential test against the pristine HEAD
(256 envs x 50 steps, FP32): position/velocity statistics match to within
floating-point noise floor; n_contacts_sum identical (2048 == 2048).

Scope: single file, +40 / -36 = +4 LoC net.
@yaoliu13
Copy link
Copy Markdown
Collaborator

This PR is not based on the latest amd-integration: perf/jlohia/broadphase-cleanup...ROCm:Genesis:amd-integration

@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented Apr 28, 2026

Closing this in favor of the monolith-focused PR #34. The broadphase cleanup is small and no longer clears the current release-baseline bar for E2E impact.

@lohiaj lohiaj closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants