perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch#28
perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch#28lohiaj wants to merge 1 commit intoamd-integrationfrom
Conversation
|
/run-ci |
3 similar comments
|
/run-ci |
|
/run-ci |
|
/run-ci |
|
@lohiaj please fix the linter issues and remove superfluous formatting changes. |
The kernel launcher previously memcpy_h2d'd the host arg buffer to a shared thread-local device buffer on every launch. For long-running simulations where the same kernel is repeatedly called with the same ndarray pointers and scene-static scalars, almost every one of these copies is redundant. Replace the shared thread_local persistent buffer with a per-handle device buffer + cached host bytes. On each launch, memcmp the current host arg buffer against the cached bytes for THIS handle; skip the H2D when they match. Per-handle ownership ensures no other handle can overwrite the device bytes between calls of the same handle. Correctness: - When the cache misses, behavior is identical to before (do the H2D, update the cache). - When the cache hits, the device buffer still contains the cached bytes verbatim because it is owned by this handle alone, so the kernel sees the same arg buffer it would have seen with the H2D. - First launch of each handle: cache is empty, so the comparison fails and the H2D runs. - Resize past capacity clears the cache, forcing the next H2D. Memory cost: per-handle device buffer ~256 bytes typical, ~50-100 handles in a real workload, ~25 KB device + same on host. Negligible.
2cb7fcd to
54dda98
Compare
|
@diptorupd thanks, fixed in
PR diff is now a single commit, 33+/11- across two files, only substantive changes. Pre-existing clang-format complaints on the file (lines 100, 178, 216, 222, 232, 238) are left alone. they're not in this PR's diff. Ready for another look. |
|
Quick follow-up on the new Linters run: The This is the same pre-existing-Linters infra issue we hit on PR #15 (deepsek's launch-overhead) and PR #26 (subgroup primitives). |
Yes, I am fixing all these in #29. Let me get that merged and then you can rebase. |
|
@lohiaj I understand the intent and for the specific Genesis workload these changes make sense and are correct. I have some concerns about the change being incorrect for other scenarios and introducing brittleness.
|
Summary
The AMDGPU kernel launcher previously copied the host arg buffer to a
shared thread-local device buffer on every launch. For long-running
simulations where the same kernel is repeatedly called with byte-identical
arg buffers (the common per-step pattern), almost every one of these
copies is redundant.
Replace the shared thread-local buffer with a per-handle device
buffer + cached host bytes. On each launch,
memcmpthe current hostarg buffer against the cached bytes for THIS handle and skip the H2D
when they match. Per-handle ownership of the device buffer ensures no
other handle can overwrite its bytes between calls.
Correctness
because it's owned by this handle alone, so the kernel sees the same
arg buffer it would have seen with the H2D.
Verified bit-identical state evolution over a 2000-step rigid-body sim
on AMDGPU vs the unmodified launcher (qpos and qvel match to full FP32
precision).
Measurement
3-trial paired-interleaved bench, canonical benchmark script:
Throughput delta is much larger than the kernel-time delta because the
saved H2D enqueues weren't blocking the GPU's kernel execution path —
they were eating Python+driver wall time and serializing the launch
stream.
Memory cost
Per-handle device buffer: ~256 B typical × ~50-100 handles = ~25 KB
device memory total. Negligible.
Test plan
-0.7% total kernel time, both passing the perf gate
/run-ci)