Skip to content

broadphase improvements#60

Open
gpinkert wants to merge 4 commits intoamd-integrationfrom
perf/broadphase-stack
Open

broadphase improvements#60
gpinkert wants to merge 4 commits intoamd-integrationfrom
perf/broadphase-stack

Conversation

@gpinkert
Copy link
Copy Markdown

@gpinkert gpinkert commented May 1, 2026

Misc perf updates

@gpinkert gpinkert force-pushed the perf/broadphase-stack branch 3 times, most recently from 0f5e425 to 13dda1f Compare May 1, 2026 18:17
@gpinkert
Copy link
Copy Markdown
Author

gpinkert commented May 1, 2026

/run-ci

@gpinkert gpinkert force-pushed the perf/broadphase-stack branch 2 times, most recently from 1f086d9 to 371c8f2 Compare May 1, 2026 20:32
@gpinkert
Copy link
Copy Markdown
Author

gpinkert commented May 1, 2026

/run-ci

# Filter out collision pairs that are involved in dynamically registered weld equality constraints.
# Use hoisted n_eq_static / n_eq_dyn so that envs with no dynamic equalities
# skip the per-pair scan entirely (no scattered qd_n_equalities load per pair).
if n_eq_dyn > n_eq_static:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this check is required? It seems redundant to me but I might be missing something. The way I see it you've hoisted the load and both n_eq_dyn and n_eq_static should already be in registers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, you mean why not just check for this condition before executing the kernel?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No as in why have the additional if n_eq_dyn > n_eq_static? Wouldn't the loop condition prevent the loop body from executing if n_eq_dyn >= n_eq_static was actually true?

Why do:

if n_eq_dyn > n_eq_static:
  for i_eq in range(n_eq_static, n_eq_dyn):
    # Loop body

instead of:

for i_eq in range(n_eq_static, n_eq_dyn):
   # Loop body

Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
Comment thread genesis/engine/solvers/rigid/collider/broadphase.py Outdated
@gpinkert gpinkert force-pushed the perf/broadphase-stack branch 3 times, most recently from 9a8a646 to 5a95ebc Compare May 1, 2026 23:01
Copy link
Copy Markdown

@kevinjosephamd kevinjosephamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the first 3 commits. Took a quick glance at commit 4 and it seems like a typical thread-coarsening optimization and as for the AABB vectorization I'll defer to someone else with more expertise to review.

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 2, 2026

1362802 and 5078

@yaoliu13 yaoliu13 force-pushed the perf/broadphase-stack branch from 5a95ebc to 68dd67c Compare May 3, 2026 06:44
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 3, 2026

/run-ci

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

@gpinkert gpinkert force-pushed the perf/broadphase-stack branch from 68dd67c to f56136e Compare May 5, 2026 04:27
@yaoliu13 yaoliu13 force-pushed the perf/broadphase-stack branch from f56136e to b767637 Compare May 5, 2026 10:05
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 5, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

1424101.9 and 4,732.00

@yaoliu13 yaoliu13 force-pushed the perf/broadphase-stack branch from b767637 to 8ebf5c6 Compare May 6, 2026 00:38
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

@yaoliu13 yaoliu13 force-pushed the perf/broadphase-stack branch from 8ebf5c6 to 21cfb1e Compare May 6, 2026 06:35
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

1766253.8 and 3,932.00

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

This PR reduces kernel launches, scattered loads, and SoA streams in broadphase.

Claude (perf agent) added 2 commits May 6, 2026 22:42
Problem: func_collision_clear was a self-contained kernel with its own
qd.loop_config / per-env outer loop. Because it ran as a separate
kernel, each broadphase step paid one extra HIP dispatch and a
contact_data RMW round-trip through global memory between the clear
kernel and the main sweep kernel.

Fix: collapse the clear into the same kernel as the SAP sweep so the
contact_data writes land in cache and are immediately consumed by the
sweep, with no inter-kernel synchronization point.

Implementation: refactor func_collision_clear into a per-env helper
func_collision_clear_per_env(i_b, ...) that drops the outer loop and
qd.loop_config. Call it from the top of func_broad_phase's existing
per-env loop body. Drop the now-unused import in collider.py.

Eliminates one HIP dispatch per simulation step and keeps the cleared
contact state in cache for the immediately-following SAP sweep.
Problem: func_check_collision_valid is invoked once per candidate
broad pair, so it runs in the inner loop of the SAP sweep. Inside, it
loops `for i_eq in range(n_equalities[None], qd_n_equalities[i_b])`
to filter pairs against dynamically-registered weld equalities. The
qd_n_equalities[i_b] read is a scattered per-env i32 load with
~50-cycle latency, paid on every call. When no dynamic equalities are
registered (the common case for typical scenes), the loop has zero
iterations but the load still happens.

Fix: hoist both equality bounds to once-per-env so the per-pair
check sees them as register-resident scalars. The per-pair loop
then becomes `for i_eq in range(n_eq_static, n_eq_dyn)`, which is
naturally a no-op when there are no dynamic equalities (n_eq_dyn ==
n_eq_static).

Implementation: add n_eq_static and n_eq_dyn parameters to
func_check_collision_valid. Compute them once at the top of
func_broad_phase's per-env loop. Update all three call sites.
Claude (perf agent) and others added 2 commits May 6, 2026 22:42
Problem: sort_buffer was three Structure-of-Arrays fields -- value
(fp32), i_g (i32), and is_max (u1/bool). Every iteration of the
insertion sort and SAP sweep touches all three fields per element.
With three independent global arrays this means three distinct
cache-line streams and three separate load instructions per element
in the hot inner loops.

Fix: pack i_g and is_max into a single field where bit 0 = is_max and
bits 1..31 = i_g. The fp32 value field is left untouched.

The packed field is typed u32 (not the project-default i32) for two
reasons:
 * Range parity: with u32 we get bits 1..31 = 31 bits of i_g, matching
   the original signed-i32 non-negative range (~2.1 B values). With
   signed i32 the encoding would have stolen the sign bit too, halving
   the range to ~1.07 B. Either ceiling is far above any realistic
   geom count, but recovering full parity is one dtype choice.
 * Shift semantics: u32 right-shift is unambiguously logical
   (zero-fill), so the unpack helper does not depend on the high bit
   being clear. This makes the encoding robust if anyone ever lifts
   the i_g range to use bits up through 31.

Defensive layer to make misuse loud rather than silent:
 * Three @qd.func helpers in broadphase.py are the single source of
   truth for the encoding:
     func_pack_event(i_g, is_max) -> u32
     func_unpack_i_g(packed)      -> i32
     func_unpack_is_max(packed)   -> bool
   All readers/writers of sort_buffer.i_g_packed go through them. Any
   future bit-layout change is one edit, not thirty. The module-level
   comment block above the helpers documents the layout invariant.
 * StructSortBuffer's docstring states the bit layout, dtype rationale,
   and the rule that callers must use the helpers (not raw bit ops).
 * get_sort_buffer asserts solver.n_geoms_ <= SORT_BUFFER_MAX_GEOMS
   (= 2**31 - 1) at scene-build time. Overflow would silently corrupt
   the is_max bit during packing; the assert turns that into a loud
   failure with an actionable message ("promote to qd.u64").

Implementation: replace StructSortBuffer.{i_g, is_max} with
i_g_packed: u32 in array_class.py. Add the three pack/unpack helpers
plus a module-level layout comment in broadphase.py. Update every
sort_buffer.{i_g, is_max} access site to call the helpers. In the
hibernation branch, hoist the previously-redundant inline reloads of
sort_buffer.i_g_packed[i, i_b] into a single 'packed' local at the
top of the loop body so each iteration loads the field exactly once.
Leave value as a separate fp32 field so the hibernation path in
forward_kinematics.py that writes sort_buffer.value continues to
work without changes.

Reduces the sort_buffer SoA stream count from 3 to 2 in the inner
loops.
…ve phase)

Adapts perf/broadphase-stack commits 1+2+3 to live on top of PR #56's LDS
infrastructure. Cooperative parallel re-fill (the original commit 4) is
intentionally excluded for this experiment because adding qd.simt.block.sync()
to a loop that doesn't fill block_dim=64 (e.g. n_envs=1 -> 4 active lanes)
triggers a workgroup-barrier UB and a GPU memory-access fault.

Changes inside func_broad_phase_lds:
* Replace the now-removed func_collision_clear() kernel call with the per-env
  helper func_collision_clear_per_env(i_b, ...).
* Hoist n_eq_static / n_eq_dyn at the top of the per-env block and pass them
  to all 3 func_check_collision_valid call sites in this path.
* Switch lds_sort_packed dtype from gs.qd_int to qd.u32 to match the
  StructSortBuffer.i_g_packed encoding contract.
* Replace open-coded (i_g << 1) / >> 1 / & 1 patterns with func_pack_event /
  func_unpack_i_g / func_unpack_is_max throughout.
* Fix the post-cherry-pick broken sort_buffer.{is_max, i_g} refs (8 sites)
  to use the renamed sort_buffer.i_g_packed field via helpers.
* Collapse the write-back loop from 2 stores per event (separate i_g + is_max)
  to 1 store (i_g_packed).
@yaoliu13 yaoliu13 force-pushed the perf/broadphase-stack branch from 21cfb1e to 926924a Compare May 7, 2026 05:42
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants