broadphase improvements#60
Conversation
0f5e425 to
13dda1f
Compare
|
/run-ci |
1f086d9 to
371c8f2
Compare
|
/run-ci |
| # Filter out collision pairs that are involved in dynamically registered weld equality constraints. | ||
| # Use hoisted n_eq_static / n_eq_dyn so that envs with no dynamic equalities | ||
| # skip the per-pair scan entirely (no scattered qd_n_equalities load per pair). | ||
| if n_eq_dyn > n_eq_static: |
There was a problem hiding this comment.
Can you explain why this check is required? It seems redundant to me but I might be missing something. The way I see it you've hoisted the load and both n_eq_dyn and n_eq_static should already be in registers.
There was a problem hiding this comment.
To clarify, you mean why not just check for this condition before executing the kernel?
There was a problem hiding this comment.
No as in why have the additional if n_eq_dyn > n_eq_static? Wouldn't the loop condition prevent the loop body from executing if n_eq_dyn >= n_eq_static was actually true?
Why do:
if n_eq_dyn > n_eq_static:
for i_eq in range(n_eq_static, n_eq_dyn):
# Loop body
instead of:
for i_eq in range(n_eq_static, n_eq_dyn):
# Loop body
9a8a646 to
5a95ebc
Compare
kevinjosephamd
left a comment
There was a problem hiding this comment.
Reviewed the first 3 commits. Took a quick glance at commit 4 and it seems like a typical thread-coarsening optimization and as for the AABB vectorization I'll defer to someone else with more expertise to review.
|
1362802 and 5078 |
5a95ebc to
68dd67c
Compare
|
/run-ci |
yaoliu13
left a comment
There was a problem hiding this comment.
This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".
68dd67c to
f56136e
Compare
f56136e to
b767637
Compare
|
/run-ci |
|
1424101.9 and 4,732.00 |
b767637 to
8ebf5c6
Compare
|
/run-ci |
8ebf5c6 to
21cfb1e
Compare
|
/run-ci |
1 similar comment
|
/run-ci |
|
1766253.8 and 3,932.00 |
|
This PR reduces kernel launches, scattered loads, and SoA streams in broadphase. |
Problem: func_collision_clear was a self-contained kernel with its own qd.loop_config / per-env outer loop. Because it ran as a separate kernel, each broadphase step paid one extra HIP dispatch and a contact_data RMW round-trip through global memory between the clear kernel and the main sweep kernel. Fix: collapse the clear into the same kernel as the SAP sweep so the contact_data writes land in cache and are immediately consumed by the sweep, with no inter-kernel synchronization point. Implementation: refactor func_collision_clear into a per-env helper func_collision_clear_per_env(i_b, ...) that drops the outer loop and qd.loop_config. Call it from the top of func_broad_phase's existing per-env loop body. Drop the now-unused import in collider.py. Eliminates one HIP dispatch per simulation step and keeps the cleared contact state in cache for the immediately-following SAP sweep.
Problem: func_check_collision_valid is invoked once per candidate broad pair, so it runs in the inner loop of the SAP sweep. Inside, it loops `for i_eq in range(n_equalities[None], qd_n_equalities[i_b])` to filter pairs against dynamically-registered weld equalities. The qd_n_equalities[i_b] read is a scattered per-env i32 load with ~50-cycle latency, paid on every call. When no dynamic equalities are registered (the common case for typical scenes), the loop has zero iterations but the load still happens. Fix: hoist both equality bounds to once-per-env so the per-pair check sees them as register-resident scalars. The per-pair loop then becomes `for i_eq in range(n_eq_static, n_eq_dyn)`, which is naturally a no-op when there are no dynamic equalities (n_eq_dyn == n_eq_static). Implementation: add n_eq_static and n_eq_dyn parameters to func_check_collision_valid. Compute them once at the top of func_broad_phase's per-env loop. Update all three call sites.
Problem: sort_buffer was three Structure-of-Arrays fields -- value
(fp32), i_g (i32), and is_max (u1/bool). Every iteration of the
insertion sort and SAP sweep touches all three fields per element.
With three independent global arrays this means three distinct
cache-line streams and three separate load instructions per element
in the hot inner loops.
Fix: pack i_g and is_max into a single field where bit 0 = is_max and
bits 1..31 = i_g. The fp32 value field is left untouched.
The packed field is typed u32 (not the project-default i32) for two
reasons:
* Range parity: with u32 we get bits 1..31 = 31 bits of i_g, matching
the original signed-i32 non-negative range (~2.1 B values). With
signed i32 the encoding would have stolen the sign bit too, halving
the range to ~1.07 B. Either ceiling is far above any realistic
geom count, but recovering full parity is one dtype choice.
* Shift semantics: u32 right-shift is unambiguously logical
(zero-fill), so the unpack helper does not depend on the high bit
being clear. This makes the encoding robust if anyone ever lifts
the i_g range to use bits up through 31.
Defensive layer to make misuse loud rather than silent:
* Three @qd.func helpers in broadphase.py are the single source of
truth for the encoding:
func_pack_event(i_g, is_max) -> u32
func_unpack_i_g(packed) -> i32
func_unpack_is_max(packed) -> bool
All readers/writers of sort_buffer.i_g_packed go through them. Any
future bit-layout change is one edit, not thirty. The module-level
comment block above the helpers documents the layout invariant.
* StructSortBuffer's docstring states the bit layout, dtype rationale,
and the rule that callers must use the helpers (not raw bit ops).
* get_sort_buffer asserts solver.n_geoms_ <= SORT_BUFFER_MAX_GEOMS
(= 2**31 - 1) at scene-build time. Overflow would silently corrupt
the is_max bit during packing; the assert turns that into a loud
failure with an actionable message ("promote to qd.u64").
Implementation: replace StructSortBuffer.{i_g, is_max} with
i_g_packed: u32 in array_class.py. Add the three pack/unpack helpers
plus a module-level layout comment in broadphase.py. Update every
sort_buffer.{i_g, is_max} access site to call the helpers. In the
hibernation branch, hoist the previously-redundant inline reloads of
sort_buffer.i_g_packed[i, i_b] into a single 'packed' local at the
top of the loop body so each iteration loads the field exactly once.
Leave value as a separate fp32 field so the hibernation path in
forward_kinematics.py that writes sort_buffer.value continues to
work without changes.
Reduces the sort_buffer SoA stream count from 3 to 2 in the inner
loops.
…ve phase) Adapts perf/broadphase-stack commits 1+2+3 to live on top of PR #56's LDS infrastructure. Cooperative parallel re-fill (the original commit 4) is intentionally excluded for this experiment because adding qd.simt.block.sync() to a loop that doesn't fill block_dim=64 (e.g. n_envs=1 -> 4 active lanes) triggers a workgroup-barrier UB and a GPU memory-access fault. Changes inside func_broad_phase_lds: * Replace the now-removed func_collision_clear() kernel call with the per-env helper func_collision_clear_per_env(i_b, ...). * Hoist n_eq_static / n_eq_dyn at the top of the per-env block and pass them to all 3 func_check_collision_valid call sites in this path. * Switch lds_sort_packed dtype from gs.qd_int to qd.u32 to match the StructSortBuffer.i_g_packed encoding contract. * Replace open-coded (i_g << 1) / >> 1 / & 1 patterns with func_pack_event / func_unpack_i_g / func_unpack_is_max throughout. * Fix the post-cherry-pick broken sort_buffer.{is_max, i_g} refs (8 sites) to use the renamed sort_buffer.i_g_packed field via helpers. * Collapse the write-back loop from 2 stores per event (separate i_g + is_max) to 1 store (i_g_packed).
21cfb1e to
926924a
Compare
|
/run-ci |
1 similar comment
|
/run-ci |
Misc perf updates