Restore clearing collision normal cache in broad_phase by chien-an-chen · Pull Request #50 · ROCm/Genesis

chien-an-chen · 2026-04-29T08:17:05Z

Summary

Restore the part of Clearing collision normal cache to fix potential accuracy issue.

Copilot

Pull request overview

Switches the rigid-body broadphase SAP implementation back to an LDS/shared-memory fast path for small scenes, with additional packing/overlap-check optimizations.

Changes:

Introduces func_broad_phase_lds using shared arrays (LDS) for sorting and active list management when n_geoms <= MAX_GEOMS_IN_LDS.
Packs (i_g, is_max) into a single shared int buffer to reduce LDS traffic.
Adjusts AABB overlap checks and removes a redundant axis-specific early-out.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepsek · 2026-04-29T13:39:57Z

/run-ci

chien-an-chen · 2026-04-29T15:58:32Z

/run-ci

chien-an-chen · 2026-04-29T22:55:36Z

/run-ci

…bgroup Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the 4 lanes per env to cooperatively execute the warm-start AABB re-fill phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay single-threaded on lane 0 due to sequential dependencies through n_active and the in-place sort buffer. Communication is via global memory + a single qd.simt.block.sync() barrier; no LDS is allocated. This intentionally keeps the design clear of the §6 LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG. Why: * The prior analysis (Pattern P1) showed the JIT was launching 152 lanes per env but only 1 was doing useful work (99% lane gating). T2.1 puts the wasted lanes to work on the warm-start re-fill, the only embarrassingly-parallel sub-phase in the SAP loop. * Available subgroup primitives in quadrants/lang/simt/subgroup.py do not include shuffle_xor or any_true, which are needed for a full cooperative bitonic sort. Within those constraints, parallelizing the warm-start re-fill is the largest no-LDS subgroup-cooperative win available. Hibernation path is intentionally not parallelized in T2.1. Risk: medium. Restructures the per-env loop body to use threaded indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync() barrier between phases. Pytest gate validation required. Co-Authored-By: Grant Pinkert <gpinkert@amd.com>

…bgroup Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the 4 lanes per env to cooperatively execute the warm-start AABB re-fill phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay single-threaded on lane 0 due to sequential dependencies through n_active and the in-place sort buffer. Communication is via global memory + a single qd.simt.block.sync() barrier; no LDS is allocated. This intentionally keeps the design clear of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG. Why: * The prior analysis (Pattern P1) showed the JIT was launching 152 lanes per env but only 1 was doing useful work (99% lane gating). T2.1 puts the wasted lanes to work on the warm-start re-fill, the only embarrassingly-parallel sub-phase in the SAP loop. * Available subgroup primitives in quadrants/lang/simt/subgroup.py do not include shuffle_xor or any_true, which are needed for a full cooperative bitonic sort. Within those constraints, parallelizing the warm-start re-fill is the largest no-LDS subgroup-cooperative win available. Hibernation path is intentionally not parallelized. This commit also folds in the vec3 AABB-load pattern from a previous attempted commit (was: T1.4 vectorize AABB component loads). The vec3 reads were a stand-alone no-op at the bench level (JIT was already coalescing the 6 scalar reads), but they make the source cleaner and came along for free with the T2.1 restructure. Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline): baseline: 488.3 us k_main, 262.5 ms k_total, 138.2 FPS this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS delta: -15.3% k_main, -18.6% k_total, +1.3% FPS Risk: medium. Restructures the per-env loop body to use threaded indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync() barrier between phases. Pytest gate on collision tests passed clean on the prior Tier 1 stack; T2.1 itself awaiting full pytest run. Co-Authored-By: Grant Pinkert <gpinkert@amd.com>

yaoliu13 · 2026-05-03T06:47:27Z

/run-ci

yaoliu13

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

chien-an-chen · 2026-05-06T01:09:55Z

/run-ci

yaoliu13 · 2026-05-06T06:37:10Z

/run-ci

yaoliu13 · 2026-05-06T17:28:09Z

/run-ci

yaoliu13 · 2026-05-07T04:59:50Z

pre-submit is not good.

chien-an-chen marked this pull request as ready for review April 29, 2026 08:17

Copilot AI review requested due to automatic review settings April 29, 2026 08:17

Copilot started reviewing on behalf of chien-an-chen April 29, 2026 08:18 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated

yaoliu13 force-pushed the perf/chienach/lds_in_func_broad_phase branch from cb3421d to 310c882 Compare May 3, 2026 06:47

yaoliu13 requested changes May 3, 2026

View reviewed changes

chien-an-chen force-pushed the perf/chienach/lds_in_func_broad_phase branch from 310c882 to 53d345f Compare May 6, 2026 01:06

chien-an-chen changed the title ~~Use LDS in func_broad_phase kernel.~~ Restore clearing collision normal cache in broad_phase May 6, 2026

chien-an-chen added 2 commits May 5, 2026 23:35

restore code in d543fe1. restore Clearing collision normal cache.

942f4bd

fix error in clearing collision normal cache

fd421e6

yaoliu13 force-pushed the perf/chienach/lds_in_func_broad_phase branch from 53d345f to fd421e6 Compare May 6, 2026 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore clearing collision normal cache in broad_phase#50

Restore clearing collision normal cache in broad_phase#50
chien-an-chen wants to merge 2 commits intoamd-integrationfrom
perf/chienach/lds_in_func_broad_phase

chien-an-chen commented Apr 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

deepsek commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

yaoliu13 commented May 3, 2026

Uh oh!

yaoliu13 left a comment

Uh oh!

chien-an-chen commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chien-an-chen commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

deepsek commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

chien-an-chen commented Apr 29, 2026

Uh oh!

yaoliu13 commented May 3, 2026

Uh oh!

yaoliu13 left a comment

Choose a reason for hiding this comment

Uh oh!

chien-an-chen commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chien-an-chen commented Apr 29, 2026 •

edited

Loading

yaoliu13 commented May 7, 2026 •

edited

Loading