Skip to content

Restore clearing collision normal cache in broad_phase#50

Open
chien-an-chen wants to merge 2 commits intoamd-integrationfrom
perf/chienach/lds_in_func_broad_phase
Open

Restore clearing collision normal cache in broad_phase#50
chien-an-chen wants to merge 2 commits intoamd-integrationfrom
perf/chienach/lds_in_func_broad_phase

Conversation

@chien-an-chen
Copy link
Copy Markdown

@chien-an-chen chien-an-chen commented Apr 29, 2026

Summary

Restore the part of Clearing collision normal cache to fix potential accuracy issue.

@chien-an-chen chien-an-chen marked this pull request as ready for review April 29, 2026 08:17
Copilot AI review requested due to automatic review settings April 29, 2026 08:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Switches the rigid-body broadphase SAP implementation back to an LDS/shared-memory fast path for small scenes, with additional packing/overlap-check optimizations.

Changes:

  • Introduces func_broad_phase_lds using shared arrays (LDS) for sorting and active list management when n_geoms <= MAX_GEOMS_IN_LDS.
  • Packs (i_g, is_max) into a single shared int buffer to reduce LDS traffic.
  • Adjusts AABB overlap checks and removes a redundant axis-specific early-out.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py
Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
@deepsek
Copy link
Copy Markdown
Collaborator

deepsek commented Apr 29, 2026

/run-ci

2 similar comments
@chien-an-chen
Copy link
Copy Markdown
Author

/run-ci

@chien-an-chen
Copy link
Copy Markdown
Author

/run-ci

gpinkert added a commit that referenced this pull request Apr 30, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the §6 LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized in T2.1.

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate validation required.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
gpinkert added a commit that referenced this pull request Apr 30, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized.

This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.

Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
  baseline:           488.3 us k_main, 262.5 ms k_total, 138.2 FPS
  this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
  delta:              -15.3% k_main, -18.6% k_total, +1.3% FPS

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
gpinkert added a commit that referenced this pull request May 1, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized.

This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.

Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
  baseline:           488.3 us k_main, 262.5 ms k_total, 138.2 FPS
  this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
  delta:              -15.3% k_main, -18.6% k_total, +1.3% FPS

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
gpinkert added a commit that referenced this pull request May 1, 2026
…bgroup

Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.

Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.

Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
  lanes per env but only 1 was doing useful work (99% lane gating).
  T2.1 puts the wasted lanes to work on the warm-start re-fill, the
  only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
  do not include shuffle_xor or any_true, which are needed for a full
  cooperative bitonic sort. Within those constraints, parallelizing
  the warm-start re-fill is the largest no-LDS subgroup-cooperative
  win available.

Hibernation path is intentionally not parallelized.

This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.

Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
  baseline:           488.3 us k_main, 262.5 ms k_total, 138.2 FPS
  this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
  delta:              -15.3% k_main, -18.6% k_total, +1.3% FPS

Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.

Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
@yaoliu13 yaoliu13 force-pushed the perf/chienach/lds_in_func_broad_phase branch from cb3421d to 310c882 Compare May 3, 2026 06:47
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 3, 2026

/run-ci

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

@chien-an-chen chien-an-chen force-pushed the perf/chienach/lds_in_func_broad_phase branch from 310c882 to 53d345f Compare May 6, 2026 01:06
@chien-an-chen chien-an-chen changed the title Use LDS in func_broad_phase kernel. Restore clearing collision normal cache in broad_phase May 6, 2026
@chien-an-chen
Copy link
Copy Markdown
Author

/run-ci

@yaoliu13 yaoliu13 force-pushed the perf/chienach/lds_in_func_broad_phase branch from 53d345f to fd421e6 Compare May 6, 2026 06:35
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

pre-submit is not good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants