Skip to content

perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch#28

Open
lohiaj wants to merge 1 commit intoamd-integrationfrom
perf/jlohia/launcher-skip-redundant-h2d
Open

perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch#28
lohiaj wants to merge 1 commit intoamd-integrationfrom
perf/jlohia/launcher-skip-redundant-h2d

Conversation

@lohiaj
Copy link
Copy Markdown

@lohiaj lohiaj commented May 6, 2026

Summary

The AMDGPU kernel launcher previously copied the host arg buffer to a
shared thread-local device buffer on every launch. For long-running
simulations where the same kernel is repeatedly called with byte-identical
arg buffers (the common per-step pattern), almost every one of these
copies is redundant.

Replace the shared thread-local buffer with a per-handle device
buffer + cached host bytes. On each launch, memcmp the current host
arg buffer against the cached bytes for THIS handle and skip the H2D
when they match. Per-handle ownership of the device buffer ensures no
other handle can overwrite its bytes between calls.

Correctness

  • Cache miss: behavior is identical to before (H2D, then update cache).
  • Cache hit: the device buffer still contains the exact cached bytes
    because it's owned by this handle alone, so the kernel sees the same
    arg buffer it would have seen with the H2D.
  • First launch of each handle: cache is empty, comparison fails, H2D runs.
  • Resize past capacity clears the cache, forcing the next H2D.

Verified bit-identical state evolution over a 2000-step rigid-body sim
on AMDGPU vs the unmodified launcher (qpos and qvel match to full FP32
precision).

Measurement

3-trial paired-interleaved bench, canonical benchmark script:

base with this patch
median throughput (baseline) +14.6%
throughput stdev across trials 0.6% 0.3%
total kernel time (rocprofv3 --kernel-trace --stats) (baseline) -0.7%

Throughput delta is much larger than the kernel-time delta because the
saved H2D enqueues weren't blocking the GPU's kernel execution path —
they were eating Python+driver wall time and serializing the launch
stream.

Memory cost

Per-handle device buffer: ~256 B typical × ~50-100 handles = ~25 KB
device memory total. Negligible.

Test plan

  • Existing per-launch behavior preserved on first launch + on resize
  • Bit-identical sim state vs unmodified launcher over 2000 steps
  • Local paired bench (3 trials, interleaved): +14.6% throughput,
    -0.7% total kernel time, both passing the perf gate
  • Pre-submit pipeline run (will trigger via PR open / /run-ci)

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

3 similar comments
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

@diptorupd
Copy link
Copy Markdown
Collaborator

@lohiaj please fix the linter issues and remove superfluous formatting changes.

The kernel launcher previously memcpy_h2d'd the host arg buffer to a
shared thread-local device buffer on every launch. For long-running
simulations where the same kernel is repeatedly called with the same
ndarray pointers and scene-static scalars, almost every one of these
copies is redundant.

Replace the shared thread_local persistent buffer with a per-handle
device buffer + cached host bytes. On each launch, memcmp the current
host arg buffer against the cached bytes for THIS handle; skip the H2D
when they match. Per-handle ownership ensures no other handle can
overwrite the device bytes between calls of the same handle.

Correctness:
- When the cache misses, behavior is identical to before (do the H2D,
  update the cache).
- When the cache hits, the device buffer still contains the cached
  bytes verbatim because it is owned by this handle alone, so the
  kernel sees the same arg buffer it would have seen with the H2D.
- First launch of each handle: cache is empty, so the comparison fails
  and the H2D runs.
- Resize past capacity clears the cache, forcing the next H2D.

Memory cost: per-handle device buffer ~256 bytes typical, ~50-100
handles in a real workload, ~25 KB device + same on host. Negligible.
@lohiaj lohiaj force-pushed the perf/jlohia/launcher-skip-redundant-h2d branch from 2cb7fcd to 54dda98 Compare May 8, 2026 17:06
@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented May 8, 2026

@diptorupd thanks, fixed in 54dda9803. Two changes:

  1. Dropped the chore: clang-format kernel_launcher.cpp commit in its entirety. that was 100% the formatting noise you flagged. It was reformatting pre-existing lines that the perf change doesn't touch (the launch_offloaded_tasks_with_do_while signature, the !on_amdgpu_device block at line 178, the two branch_counts->kNone_*.fetch_add calls, the device_ptrs = std::make_unique<...> declaration, and the memcpy_host_to_device call). All reverted; the perf commit no longer touches any of those lines.

  2. Reformatted my own new skip_h2d block to match the project's .clang-format (Chromium-based, clang-format-19 per .pre-commit-config.yaml). Verified clean via clang-format-19 --style=file --dry-run --Werror --lines=... on the lines I added.

PR diff is now a single commit, 33+/11- across two files, only substantive changes. Pre-existing clang-format complaints on the file (lines 100, 178, 216, 222, 232, 238) are left alone. they're not in this PR's diff.

Ready for another look.

@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented May 8, 2026

Quick follow-up on the new Linters run:

The Linters check is still red, but the failures are on pre-existing lines I deliberately did NOT touch (lines 100, 178, 216, 222, 232, 238 of kernel_launcher.cpp and none of which my change modifies). Verified my diff is clean:

clang-format-19 --style=file --dry-run --Werror \
    --lines=8:8 --lines=281:286 --lines=288:290 \
    --lines=299:302 --lines=304:314 \
    quadrants/runtime/amdgpu/kernel_launcher.cpp
# exit=0

This is the same pre-existing-Linters infra issue we hit on PR #15 (deepsek's launch-overhead) and PR #26 (subgroup primitives). Linters fails on amd-integration-baseline regardless of the PR's diff because the same set of files (Dockerfile.rocm, codegen_amdgpu.cpp, compile_config.cpp, runtime.cpp, test_fn_attrs.py, etc.) have format issues that pre-date this PR. PR #15 merged through it.

@diptorupd
Copy link
Copy Markdown
Collaborator

This is the same pre-existing-Linters infra issue we hit on PR #15 (deepsek's launch-overhead) and PR #26 (subgroup primitives). Linters fails on amd-integration-baseline regardless of the PR's diff because the same set of files (Dockerfile.rocm, codegen_amdgpu.cpp, compile_config.cpp, runtime.cpp, test_fn_attrs.py, etc.) have format issues that pre-date this PR. PR #15 merged through it.

Yes, I am fixing all these in #29. Let me get that merged and then you can rebase.

@diptorupd
Copy link
Copy Markdown
Collaborator

@lohiaj Please rebase on top of amd-integration. I just merged #29 and cleaned up all the linter issues

@diptorupd
Copy link
Copy Markdown
Collaborator

@lohiaj I understand the intent and for the specific Genesis workload these changes make sense and are correct.

I have some concerns about the change being incorrect for other scenarios and introducing brittleness.

  1. The change is predicated on the assumption of a single default in-order stream. The skip is correct because device-resident state evolves under stream ordering on the default nullptr stream. The moment any kernel launch moves to a non-default stream — or runs concurrently with another stream that touches the same device arg buffer — there is a chance of subtle concurrency issues. The persistent dev_arg_buf is now shared across launches, so an in-flight kernel reading from it while the next launch's H2D writes the same buffer would need explicit synchronization. The old design didn't have this hazard.

  2. HIP graph capture. Current workload does not use HIP graphs, but some other use case might require turning on HIP Graphs and I was thinking of a potential issue that might occur there. During capture, the first launch in a sequence records both the H2D and the kernel as graph nodes. On the second launch within the same capture, the skip would fire and the H2D node would be missing from the captured graph. In current Genesis example, where every launch sees the same arg-buffer bytes anyway, this happens to be harmless. The bug surfaces in any captured sequence where the host wants different arg-buffer bytes for different sub-launches: only the first kernel sees the (re-)uploaded bytes, and the remaining kernels read whatever the first H2D left in dev_arg_buf. There's no error or assert — just numerically wrong results that are hard to localize, because the failure is structurally far away from this code.

  3. I would want to understand the motivation for moving the launch state out of thread_local. By removing thread_local for persistent_dev_arg_buf / persistent_dev_arg_buf_cap The PR moves these (plus the new last_host_arg_buf) into launcher_ctx, which is shared per-handle. If two host threads ever launch the same handle concurrently, they now race on dev_arg_buf, dev_arg_buf_cap, and last_host_arg_buf. I don't know whether the current call sites do that, but if the answer is "no, it's externally serialized," that contract should be stated explicitly — the old code didn't need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants