broadphase improvements by gpinkert · Pull Request #60 · ROCm/Genesis

gpinkert · 2026-05-01T17:24:30Z

Misc perf updates

gpinkert · 2026-05-01T18:25:23Z

/run-ci

gpinkert · 2026-05-01T20:33:43Z

/run-ci

kevinjosephamd · 2026-05-01T21:18:59Z

+        # Filter out collision pairs that are involved in dynamically registered weld equality constraints.
+        # Use hoisted n_eq_static / n_eq_dyn so that envs with no dynamic equalities
+        # skip the per-pair scan entirely (no scattered qd_n_equalities load per pair).
+        if n_eq_dyn > n_eq_static:


Can you explain why this check is required? It seems redundant to me but I might be missing something. The way I see it you've hoisted the load and both n_eq_dyn and n_eq_static should already be in registers.

To clarify, you mean why not just check for this condition before executing the kernel?

No as in why have the additional if n_eq_dyn > n_eq_static? Wouldn't the loop condition prevent the loop body from executing if n_eq_dyn >= n_eq_static was actually true?

Why do:

if n_eq_dyn > n_eq_static: for i_eq in range(n_eq_static, n_eq_dyn): # Loop body

instead of:

for i_eq in range(n_eq_static, n_eq_dyn): # Loop body

kevinjosephamd

Reviewed the first 3 commits. Took a quick glance at commit 4 and it seems like a typical thread-coarsening optimization and as for the AABB vectorization I'll defer to someone else with more expertise to review.

yaoliu13 · 2026-05-02T02:51:38Z

1362802 and 5078

yaoliu13 · 2026-05-03T06:44:34Z

/run-ci

yaoliu13

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

yaoliu13 · 2026-05-05T16:09:53Z

/run-ci

yaoliu13 · 2026-05-06T00:06:22Z

1424101.9 and 4,732.00

yaoliu13 · 2026-05-06T00:39:28Z

/run-ci

yaoliu13 · 2026-05-06T06:37:05Z

/run-ci

yaoliu13 · 2026-05-06T17:27:40Z

/run-ci

yaoliu13 · 2026-05-07T05:06:22Z

1766253.8 and 3,932.00

yaoliu13 · 2026-05-07T05:20:16Z

This PR reduces kernel launches, scattered loads, and SoA streams in broadphase.

Problem: func_collision_clear was a self-contained kernel with its own qd.loop_config / per-env outer loop. Because it ran as a separate kernel, each broadphase step paid one extra HIP dispatch and a contact_data RMW round-trip through global memory between the clear kernel and the main sweep kernel. Fix: collapse the clear into the same kernel as the SAP sweep so the contact_data writes land in cache and are immediately consumed by the sweep, with no inter-kernel synchronization point. Implementation: refactor func_collision_clear into a per-env helper func_collision_clear_per_env(i_b, ...) that drops the outer loop and qd.loop_config. Call it from the top of func_broad_phase's existing per-env loop body. Drop the now-unused import in collider.py. Eliminates one HIP dispatch per simulation step and keeps the cleared contact state in cache for the immediately-following SAP sweep.

Problem: func_check_collision_valid is invoked once per candidate broad pair, so it runs in the inner loop of the SAP sweep. Inside, it loops `for i_eq in range(n_equalities[None], qd_n_equalities[i_b])` to filter pairs against dynamically-registered weld equalities. The qd_n_equalities[i_b] read is a scattered per-env i32 load with ~50-cycle latency, paid on every call. When no dynamic equalities are registered (the common case for typical scenes), the loop has zero iterations but the load still happens. Fix: hoist both equality bounds to once-per-env so the per-pair check sees them as register-resident scalars. The per-pair loop then becomes `for i_eq in range(n_eq_static, n_eq_dyn)`, which is naturally a no-op when there are no dynamic equalities (n_eq_dyn == n_eq_static). Implementation: add n_eq_static and n_eq_dyn parameters to func_check_collision_valid. Compute them once at the top of func_broad_phase's per-env loop. Update all three call sites.

Problem: sort_buffer was three Structure-of-Arrays fields -- value (fp32), i_g (i32), and is_max (u1/bool). Every iteration of the insertion sort and SAP sweep touches all three fields per element. With three independent global arrays this means three distinct cache-line streams and three separate load instructions per element in the hot inner loops. Fix: pack i_g and is_max into a single field where bit 0 = is_max and bits 1..31 = i_g. The fp32 value field is left untouched. The packed field is typed u32 (not the project-default i32) for two reasons: * Range parity: with u32 we get bits 1..31 = 31 bits of i_g, matching the original signed-i32 non-negative range (~2.1 B values). With signed i32 the encoding would have stolen the sign bit too, halving the range to ~1.07 B. Either ceiling is far above any realistic geom count, but recovering full parity is one dtype choice. * Shift semantics: u32 right-shift is unambiguously logical (zero-fill), so the unpack helper does not depend on the high bit being clear. This makes the encoding robust if anyone ever lifts the i_g range to use bits up through 31. Defensive layer to make misuse loud rather than silent: * Three @qd.func helpers in broadphase.py are the single source of truth for the encoding: func_pack_event(i_g, is_max) -> u32 func_unpack_i_g(packed) -> i32 func_unpack_is_max(packed) -> bool All readers/writers of sort_buffer.i_g_packed go through them. Any future bit-layout change is one edit, not thirty. The module-level comment block above the helpers documents the layout invariant. * StructSortBuffer's docstring states the bit layout, dtype rationale, and the rule that callers must use the helpers (not raw bit ops). * get_sort_buffer asserts solver.n_geoms_ <= SORT_BUFFER_MAX_GEOMS (= 2**31 - 1) at scene-build time. Overflow would silently corrupt the is_max bit during packing; the assert turns that into a loud failure with an actionable message ("promote to qd.u64"). Implementation: replace StructSortBuffer.{i_g, is_max} with i_g_packed: u32 in array_class.py. Add the three pack/unpack helpers plus a module-level layout comment in broadphase.py. Update every sort_buffer.{i_g, is_max} access site to call the helpers. In the hibernation branch, hoist the previously-redundant inline reloads of sort_buffer.i_g_packed[i, i_b] into a single 'packed' local at the top of the loop body so each iteration loads the field exactly once. Leave value as a separate fp32 field so the hibernation path in forward_kinematics.py that writes sort_buffer.value continues to work without changes. Reduces the sort_buffer SoA stream count from 3 to 2 in the inner loops.

…ve phase) Adapts perf/broadphase-stack commits 1+2+3 to live on top of PR #56's LDS infrastructure. Cooperative parallel re-fill (the original commit 4) is intentionally excluded for this experiment because adding qd.simt.block.sync() to a loop that doesn't fill block_dim=64 (e.g. n_envs=1 -> 4 active lanes) triggers a workgroup-barrier UB and a GPU memory-access fault. Changes inside func_broad_phase_lds: * Replace the now-removed func_collision_clear() kernel call with the per-env helper func_collision_clear_per_env(i_b, ...). * Hoist n_eq_static / n_eq_dyn at the top of the per-env block and pass them to all 3 func_check_collision_valid call sites in this path. * Switch lds_sort_packed dtype from gs.qd_int to qd.u32 to match the StructSortBuffer.i_g_packed encoding contract. * Replace open-coded (i_g << 1) / >> 1 / & 1 patterns with func_pack_event / func_unpack_i_g / func_unpack_is_max throughout. * Fix the post-cherry-pick broken sort_buffer.{is_max, i_g} refs (8 sites) to use the renamed sort_buffer.i_g_packed field via helpers. * Collapse the write-back loop from 2 stores per event (separate i_g + is_max) to 1 store (i_g_packed).

yaoliu13 · 2026-05-07T05:43:00Z

/run-ci

yaoliu13 · 2026-05-07T11:46:33Z

/run-ci

gpinkert force-pushed the perf/broadphase-stack branch 3 times, most recently from 0f5e425 to 13dda1f Compare May 1, 2026 18:17

gpinkert force-pushed the perf/broadphase-stack branch 2 times, most recently from 1f086d9 to 371c8f2 Compare May 1, 2026 20:32

yaoliu13 mentioned this pull request May 1, 2026

[PERF IMPROVEMENT] Add amdgpu-waves-per-eu fn_attrs to kernels. #39

Merged

kevinjosephamd reviewed May 1, 2026

View reviewed changes

gpinkert force-pushed the perf/broadphase-stack branch 3 times, most recently from 9a8a646 to 5a95ebc Compare May 1, 2026 23:01

kevinjosephamd approved these changes May 1, 2026

View reviewed changes

yaoliu13 force-pushed the perf/broadphase-stack branch from 5a95ebc to 68dd67c Compare May 3, 2026 06:44

yaoliu13 requested changes May 3, 2026

View reviewed changes

gpinkert force-pushed the perf/broadphase-stack branch from 68dd67c to f56136e Compare May 5, 2026 04:27

yaoliu13 force-pushed the perf/broadphase-stack branch from f56136e to b767637 Compare May 5, 2026 10:05

yaoliu13 force-pushed the perf/broadphase-stack branch from b767637 to 8ebf5c6 Compare May 6, 2026 00:38

yaoliu13 force-pushed the perf/broadphase-stack branch from 8ebf5c6 to 21cfb1e Compare May 6, 2026 06:35

Claude (perf agent) added 2 commits May 6, 2026 22:42

Claude (perf agent) and others added 2 commits May 6, 2026 22:42

yaoliu13 force-pushed the perf/broadphase-stack branch from 21cfb1e to 926924a Compare May 7, 2026 05:42

Conversation

gpinkert commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gpinkert commented May 1, 2026

Uh oh!

gpinkert commented May 1, 2026

Uh oh!

kevinjosephamd May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gpinkert May 1, 2026

Choose a reason for hiding this comment

Uh oh!

kevinjosephamd May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjosephamd left a comment

Choose a reason for hiding this comment

Uh oh!

yaoliu13 commented May 2, 2026

Uh oh!

yaoliu13 commented May 3, 2026

Uh oh!

yaoliu13 left a comment

Choose a reason for hiding this comment

Uh oh!

yaoliu13 commented May 5, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gpinkert commented May 1, 2026 •

edited

Loading