Use LDS in func_broad_phase kernel. by chien-an-chen · Pull Request #50 · ROCm/Genesis

chien-an-chen · 2026-04-29T08:17:05Z

Summary

Switch back to using LDS in func_broad_phase kernel. cherry-pick from 77c421b

Restore the part of Clearing collision normal cache

Add more optimization:

Change to use lds_sort_packed to store i_g and is_max.
Remove duplicate checking max_a_axis < min_b0

change to use lds_sort_packed to store i_g and is_max.

Copilot

Pull request overview

Switches the rigid-body broadphase SAP implementation back to an LDS/shared-memory fast path for small scenes, with additional packing/overlap-check optimizations.

Changes:

Introduces func_broad_phase_lds using shared arrays (LDS) for sorting and active list management when n_geoms <= MAX_GEOMS_IN_LDS.
Packs (i_g, is_max) into a single shared int buffer to reduce LDS traffic.
Adjusts AABB overlap checks and removes a redundant axis-specific early-out.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepsek · 2026-04-29T13:39:57Z

/run-ci

…init. only copy lds_sort_value to collider_state.sort_buffer.value when use_hibernation=True.

chien-an-chen · 2026-04-29T15:58:32Z

/run-ci

chien-an-chen · 2026-04-29T22:55:36Z

/run-ci

…bgroup Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the 4 lanes per env to cooperatively execute the warm-start AABB re-fill phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay single-threaded on lane 0 due to sequential dependencies through n_active and the in-place sort buffer. Communication is via global memory + a single qd.simt.block.sync() barrier; no LDS is allocated. This intentionally keeps the design clear of the §6 LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG. Why: * The prior analysis (Pattern P1) showed the JIT was launching 152 lanes per env but only 1 was doing useful work (99% lane gating). T2.1 puts the wasted lanes to work on the warm-start re-fill, the only embarrassingly-parallel sub-phase in the SAP loop. * Available subgroup primitives in quadrants/lang/simt/subgroup.py do not include shuffle_xor or any_true, which are needed for a full cooperative bitonic sort. Within those constraints, parallelizing the warm-start re-fill is the largest no-LDS subgroup-cooperative win available. Hibernation path is intentionally not parallelized in T2.1. Risk: medium. Restructures the per-env loop body to use threaded indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync() barrier between phases. Pytest gate validation required. Co-Authored-By: Grant Pinkert <gpinkert@amd.com>

…bgroup Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the 4 lanes per env to cooperatively execute the warm-start AABB re-fill phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay single-threaded on lane 0 due to sequential dependencies through n_active and the in-place sort buffer. Communication is via global memory + a single qd.simt.block.sync() barrier; no LDS is allocated. This intentionally keeps the design clear of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG. Why: * The prior analysis (Pattern P1) showed the JIT was launching 152 lanes per env but only 1 was doing useful work (99% lane gating). T2.1 puts the wasted lanes to work on the warm-start re-fill, the only embarrassingly-parallel sub-phase in the SAP loop. * Available subgroup primitives in quadrants/lang/simt/subgroup.py do not include shuffle_xor or any_true, which are needed for a full cooperative bitonic sort. Within those constraints, parallelizing the warm-start re-fill is the largest no-LDS subgroup-cooperative win available. Hibernation path is intentionally not parallelized. This commit also folds in the vec3 AABB-load pattern from a previous attempted commit (was: T1.4 vectorize AABB component loads). The vec3 reads were a stand-alone no-op at the bench level (JIT was already coalescing the 6 scalar reads), but they make the source cleaner and came along for free with the T2.1 restructure. Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline): baseline: 488.3 us k_main, 262.5 ms k_total, 138.2 FPS this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS delta: -15.3% k_main, -18.6% k_total, +1.3% FPS Risk: medium. Restructures the per-env loop body to use threaded indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync() barrier between phases. Pytest gate on collision tests passed clean on the prior Tier 1 stack; T2.1 itself awaiting full pytest run. Co-Authored-By: Grant Pinkert <gpinkert@amd.com>

Switch back to using LDS in func_broad_phase kernel.

961299b

change to use lds_sort_packed to store i_g and is_max.

chien-an-chen marked this pull request as ready for review April 29, 2026 08:17

Copilot AI review requested due to automatic review settings April 29, 2026 08:17

Copilot started reviewing on behalf of chien-an-chen April 29, 2026 08:18 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated

chien-an-chen added 2 commits April 29, 2026 01:34

fix error: forgot changing lds_sort_is_max to lds_sort_packed.

b9b0690

restore code in d543fe1. restore Clearing collision normal cache.

ceb91f2

chien-an-chen added 2 commits April 29, 2026 07:32

fix error in clearing collision normal cache

8b895e3

lds_active no longer copy data from collider_state.active_buffer for …

cb3421d

…init. only copy lds_sort_value to collider_state.sort_buffer.value when use_hibernation=True.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LDS in func_broad_phase kernel.#50

Use LDS in func_broad_phase kernel.#50
chien-an-chen wants to merge 5 commits intoamd-integrationfrom
perf/chienach/lds_in_func_broad_phase

chien-an-chen commented Apr 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

deepsek commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chien-an-chen commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

deepsek commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chien-an-chen commented Apr 29, 2026 •

edited

Loading