perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch by lohiaj · Pull Request #28 · ROCm/quadrants

lohiaj · 2026-05-06T16:44:36Z

Summary

The AMDGPU kernel launcher previously copied the host arg buffer to a
shared thread-local device buffer on every launch. For long-running
simulations where the same kernel is repeatedly called with byte-identical
arg buffers (the common per-step pattern), almost every one of these
copies is redundant.

Replace the shared thread-local buffer with a per-handle device
buffer + cached host bytes. On each launch, memcmp the current host
arg buffer against the cached bytes for THIS handle and skip the H2D
when they match. Per-handle ownership of the device buffer ensures no
other handle can overwrite its bytes between calls.

Correctness

Cache miss: behavior is identical to before (H2D, then update cache).
Cache hit: the device buffer still contains the exact cached bytes
because it's owned by this handle alone, so the kernel sees the same
arg buffer it would have seen with the H2D.
First launch of each handle: cache is empty, comparison fails, H2D runs.
Resize past capacity clears the cache, forcing the next H2D.

Verified bit-identical state evolution over a 2000-step rigid-body sim
on AMDGPU vs the unmodified launcher (qpos and qvel match to full FP32
precision).

Measurement

3-trial paired-interleaved bench, canonical benchmark script:

	base	with this patch
median throughput	(baseline)	+14.6%
throughput stdev across trials	0.6%	0.3%
total kernel time (rocprofv3 --kernel-trace --stats)	(baseline)	-0.7%

Throughput delta is much larger than the kernel-time delta because the
saved H2D enqueues weren't blocking the GPU's kernel execution path —
they were eating Python+driver wall time and serializing the launch
stream.

Memory cost

Per-handle device buffer: ~256 B typical × ~50-100 handles = ~25 KB
device memory total. Negligible.

Test plan

Existing per-launch behavior preserved on first launch + on resize
Bit-identical sim state vs unmodified launcher over 2000 steps
Local paired bench (3 trials, interleaved): +14.6% throughput,
-0.7% total kernel time, both passing the perf gate
Pre-submit pipeline run (will trigger via PR open / /run-ci)

yaoliu13 · 2026-05-06T17:28:57Z

/run-ci

yaoliu13 · 2026-05-07T04:35:06Z

/run-ci

yaoliu13 · 2026-05-07T05:41:08Z

/run-ci

yaoliu13 · 2026-05-07T11:46:23Z

/run-ci

diptorupd · 2026-05-08T15:09:50Z

@lohiaj please fix the linter issues and remove superfluous formatting changes.

The kernel launcher previously memcpy_h2d'd the host arg buffer to a shared thread-local device buffer on every launch. For long-running simulations where the same kernel is repeatedly called with the same ndarray pointers and scene-static scalars, almost every one of these copies is redundant. Replace the shared thread_local persistent buffer with a per-handle device buffer + cached host bytes. On each launch, memcmp the current host arg buffer against the cached bytes for THIS handle; skip the H2D when they match. Per-handle ownership ensures no other handle can overwrite the device bytes between calls of the same handle. Correctness: - When the cache misses, behavior is identical to before (do the H2D, update the cache). - When the cache hits, the device buffer still contains the cached bytes verbatim because it is owned by this handle alone, so the kernel sees the same arg buffer it would have seen with the H2D. - First launch of each handle: cache is empty, so the comparison fails and the H2D runs. - Resize past capacity clears the cache, forcing the next H2D. Memory cost: per-handle device buffer ~256 bytes typical, ~50-100 handles in a real workload, ~25 KB device + same on host. Negligible.

lohiaj · 2026-05-08T17:07:22Z

@diptorupd thanks, fixed in 54dda9803. Two changes:

Dropped the chore: clang-format kernel_launcher.cpp commit in its entirety. that was 100% the formatting noise you flagged. It was reformatting pre-existing lines that the perf change doesn't touch (the launch_offloaded_tasks_with_do_while signature, the !on_amdgpu_device block at line 178, the two branch_counts->kNone_*.fetch_add calls, the device_ptrs = std::make_unique<...> declaration, and the memcpy_host_to_device call). All reverted; the perf commit no longer touches any of those lines.
Reformatted my own new skip_h2d block to match the project's .clang-format (Chromium-based, clang-format-19 per .pre-commit-config.yaml). Verified clean via clang-format-19 --style=file --dry-run --Werror --lines=... on the lines I added.

PR diff is now a single commit, 33+/11- across two files, only substantive changes. Pre-existing clang-format complaints on the file (lines 100, 178, 216, 222, 232, 238) are left alone. they're not in this PR's diff.

Ready for another look.

lohiaj · 2026-05-08T17:12:41Z

Quick follow-up on the new Linters run:

The Linters check is still red, but the failures are on pre-existing lines I deliberately did NOT touch (lines 100, 178, 216, 222, 232, 238 of kernel_launcher.cpp and none of which my change modifies). Verified my diff is clean:

clang-format-19 --style=file --dry-run --Werror \
    --lines=8:8 --lines=281:286 --lines=288:290 \
    --lines=299:302 --lines=304:314 \
    quadrants/runtime/amdgpu/kernel_launcher.cpp
# exit=0

This is the same pre-existing-Linters infra issue we hit on PR #15 (deepsek's launch-overhead) and PR #26 (subgroup primitives). Linters fails on amd-integration-baseline regardless of the PR's diff because the same set of files (Dockerfile.rocm, codegen_amdgpu.cpp, compile_config.cpp, runtime.cpp, test_fn_attrs.py, etc.) have format issues that pre-date this PR. PR #15 merged through it.

diptorupd · 2026-05-08T18:34:55Z

This is the same pre-existing-Linters infra issue we hit on PR #15 (deepsek's launch-overhead) and PR #26 (subgroup primitives). Linters fails on amd-integration-baseline regardless of the PR's diff because the same set of files (Dockerfile.rocm, codegen_amdgpu.cpp, compile_config.cpp, runtime.cpp, test_fn_attrs.py, etc.) have format issues that pre-date this PR. PR #15 merged through it.

Yes, I am fixing all these in #29. Let me get that merged and then you can rebase.

diptorupd · 2026-05-08T19:15:07Z

@lohiaj Please rebase on top of amd-integration. I just merged #29 and cleaned up all the linter issues

diptorupd · 2026-05-08T23:23:17Z

@lohiaj I understand the intent and for the specific Genesis workload these changes make sense and are correct.

I have some concerns about the change being incorrect for other scenarios and introducing brittleness.

The change is predicated on the assumption of a single default in-order stream. The skip is correct because device-resident state evolves under stream ordering on the default nullptr stream. The moment any kernel launch moves to a non-default stream — or runs concurrently with another stream that touches the same device arg buffer — there is a chance of subtle concurrency issues. The persistent dev_arg_buf is now shared across launches, so an in-flight kernel reading from it while the next launch's H2D writes the same buffer would need explicit synchronization. The old design didn't have this hazard.
HIP graph capture. Current workload does not use HIP graphs, but some other use case might require turning on HIP Graphs and I was thinking of a potential issue that might occur there. During capture, the first launch in a sequence records both the H2D and the kernel as graph nodes. On the second launch within the same capture, the skip would fire and the H2D node would be missing from the captured graph. In current Genesis example, where every launch sees the same arg-buffer bytes anyway, this happens to be harmless. The bug surfaces in any captured sequence where the host wants different arg-buffer bytes for different sub-launches: only the first kernel sees the (re-)uploaded bytes, and the remaining kernels read whatever the first H2D left in dev_arg_buf. There's no error or assert — just numerically wrong results that are hard to localize, because the failure is structurally far away from this code.
I would want to understand the motivation for moving the launch state out of thread_local. By removing thread_local for persistent_dev_arg_buf / persistent_dev_arg_buf_cap The PR moves these (plus the new last_host_arg_buf) into launcher_ctx, which is shared per-handle. If two host threads ever launch the same handle concurrently, they now race on dev_arg_buf, dev_arg_buf_cap, and last_host_arg_buf. I don't know whether the current call sites do that, but if the answer is "no, it's externally serialized," that contract should be stated explicitly — the old code didn't need it.

lohiaj force-pushed the perf/jlohia/launcher-skip-redundant-h2d branch from 2cb7fcd to 54dda98 Compare May 8, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch#28

perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch#28
lohiaj wants to merge 1 commit intoamd-integrationfrom
perf/jlohia/launcher-skip-redundant-h2d

lohiaj commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

lohiaj commented May 8, 2026 •

edited

Loading

Uh oh!

lohiaj commented May 8, 2026 •

edited

Loading

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lohiaj commented May 6, 2026

Summary

Correctness

Measurement

Memory cost

Test plan

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

lohiaj commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lohiaj commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

diptorupd commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lohiaj commented May 8, 2026 •

edited

Loading

lohiaj commented May 8, 2026 •

edited

Loading