Skip to content

Use LDS in func_broad_phase kernel.#50

Open
chien-an-chen wants to merge 5 commits intoamd-integrationfrom
perf/chienach/lds_in_func_broad_phase
Open

Use LDS in func_broad_phase kernel.#50
chien-an-chen wants to merge 5 commits intoamd-integrationfrom
perf/chienach/lds_in_func_broad_phase

Conversation

@chien-an-chen
Copy link
Copy Markdown

@chien-an-chen chien-an-chen commented Apr 29, 2026

Summary

Switch back to using LDS in func_broad_phase kernel. cherry-pick from 77c421b

Restore the part of Clearing collision normal cache

Add more optimization:

  1. Change to use lds_sort_packed to store i_g and is_max.
  2. Remove duplicate checking max_a_axis < min_b0

change to use lds_sort_packed to store i_g and is_max.
@chien-an-chen chien-an-chen marked this pull request as ready for review April 29, 2026 08:17
Copilot AI review requested due to automatic review settings April 29, 2026 08:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Switches the rigid-body broadphase SAP implementation back to an LDS/shared-memory fast path for small scenes, with additional packing/overlap-check optimizations.

Changes:

  • Introduces func_broad_phase_lds using shared arrays (LDS) for sorting and active list management when n_geoms <= MAX_GEOMS_IN_LDS.
  • Packs (i_g, is_max) into a single shared int buffer to reduce LDS traffic.
  • Adjusts AABB overlap checks and removes a redundant axis-specific early-out.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py
Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
@deepsek
Copy link
Copy Markdown
Collaborator

deepsek commented Apr 29, 2026

/run-ci

…init.

only copy lds_sort_value to collider_state.sort_buffer.value when
use_hibernation=True.
@chien-an-chen
Copy link
Copy Markdown
Author

/run-ci

1 similar comment
@chien-an-chen
Copy link
Copy Markdown
Author

/run-ci

gpinkert added a commit that referenced this pull request Apr 30, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the §6 LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized in T2.1.

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate validation required.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
gpinkert added a commit that referenced this pull request Apr 30, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized.

This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.

Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
  baseline:           488.3 us k_main, 262.5 ms k_total, 138.2 FPS
  this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
  delta:              -15.3% k_main, -18.6% k_total, +1.3% FPS

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
gpinkert added a commit that referenced this pull request May 1, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized.

This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.

Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
  baseline:           488.3 us k_main, 262.5 ms k_total, 138.2 FPS
  this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
  delta:              -15.3% k_main, -18.6% k_total, +1.3% FPS

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
gpinkert added a commit that referenced this pull request May 1, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized.

This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.

Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
  baseline:           488.3 us k_main, 262.5 ms k_total, 138.2 FPS
  this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
  delta:              -15.3% k_main, -18.6% k_total, +1.3% FPS

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants