Add ROCm/AMD GPU support for kernel generation and optimization by andyluo7 · Pull Request #119 · meta-pytorch/KernelAgent

andyluo7 · 2026-03-13T05:15:06Z

Summary

This PR adds parallel ROCm/HIP support to KernelAgent, enabling Triton kernel generation and hardware-guided optimization on AMD Instinct GPUs (MI300X, MI350X, MI250X, MI300A). All changes are strictly additive — the CUDA and XPU code paths are not modified.

--target-platform rocm CLI option added to kernel generation and Fuser pipelines
rocprof-based profiling replaces NCU for the AMD path, collecting SQ/TCC hardware counters in two passes
Heuristic roofline analysis derives compute/memory SOL estimates from raw PMC counters (VALU utilization + VMEM fraction × L2 miss amplifier)
AMD GPU specs database extended with MI300X, MI300A, MI350X, MI250X hardware specs for roofline calibration
Full platform registry integration: all components (verifier, benchmarker, profiler, roofline analyzer, bottleneck analyzer, specs provider) registered under "rocm" implementation name

Architecture

The ROCm path follows the same modular design as the NVIDIA path:

triton_kernel_agent/platform/rocm.py          # ROCm implementations of all platform interfaces
kernel_perf_agent/kernel_opt/profiler/rocprof_profiler.py   # rocprof wrapper (stats + PMC passes)
kernel_perf_agent/kernel_opt/roofline/rocm_roofline.py      # Heuristic roofline from rocprof counters
triton_kernel_agent/opt_worker_component/profiling/
  rocm_kernel_profiler.py          # Drop-in for KernelProfiler
  rocprof_wrapper_factory.py       # Generates benchmark wrapper scripts
  rocprof_wrapper_template.j2      # Jinja2 template (mirrors ncu_wrapper_template.j2)

ROCm Profiling Design

ROCm's rocprof (or rocprofv3) runs two passes:

Pass 1 — timing (rocprof --stats):
Collects kernel dispatch duration in nanoseconds.

Pass 2 — hardware counters (rocprof -i input.txt):

pmc: SQ_WAVES SQ_INSTS_VALU SQ_INSTS_SALU SQ_INSTS_VMEM_RD SQ_INSTS_VMEM_WR SQ_INSTS_LDS SQ_WAIT_INST_ANY
pmc: TCC_HIT_sum TCC_MISS_sum TCC_EA_RDREQ_sum TCC_EA_WRREQ_sum FETCH_SIZE WRITE_SIZE

Counter → metric mapping:

Counter	Derived Metric
`SQ_INSTS_VALU / total_insts`	`compute_sol_pct` (VALU utilization)
`(SQ_INSTS_VMEM_RD + WR) / total_insts × L2_miss_amplifier`	`memory_sol_pct`
`TCC_HIT / (TCC_HIT + TCC_MISS)`	`tcc_cache_hit_rate_pct`
`SQ_WAIT_INST_ANY`	stall analysis
`TCC_EA_RDREQ + FETCH_SIZE + WRITE_SIZE`	memory bandwidth

Hardware Specs Added

GPU	Architecture	BF16 TFLOPS	Memory BW	CUs	Memory
AMD Instinct MI300X	CDNA3 / gfx942	1307.4	5.3 TB/s HBM3	304	192 GB
AMD Instinct MI300A	CDNA3 / gfx942	980.6	3.2 TB/s HBM3	228	128 GB
AMD Instinct MI350X	CDNA4 / gfx950	~2304	~8 TB/s HBM3E	304	288 GB
AMD Instinct MI250X	CDNA2 / gfx90a	383	3.3 TB/s HBM2e	220	128 GB

Platform Guidance for LLM

When --target-platform rocm is used, the system prompt includes AMD-specific guidance:

Wavefront size is 64 (not 32 like NVIDIA warps)
Block sizes should be multiples of 64 (64, 128, 256, 512)
device='cuda' is correct (ROCm HIP compatibility layer)
Avoid NVIDIA-specific warp primitive assumptions

Test Plan

Import test: python -c "from triton_kernel_agent.platform import ROCmVerifier" (no ROCm hardware required)
Platform config test: from triton_kernel_agent.platform_config import get_platform; get_platform('rocm')
GPU specs test: from kernel_perf_agent.kernel_opt.diagnose_prompt.gpu_specs import get_gpu_specs; get_gpu_specs('AMD Instinct MI300X')
Registry test: from triton_kernel_agent.platform.registry import registry; registry.list_implementations('profiler') (should include 'rocm')
End-to-end profiling on MI300X/MI350X hardware (planned)
Existing CUDA and XPU tests should pass unchanged

Notes

The heuristic SOL values from rocprof are approximations, not exact hardware-reported percentages like NCU's Speed-of-Light metrics. The roofline threshold is set to 85% (vs 95% for NCU) to account for this.
rocprofv3 (rocprofiler-sdk) is preferred when available; falls back to rocprof v1/v2.
The noop platform can be used in CI environments without ROCm hardware.

🤖 Generated with Claude Code

Adds parallel ROCm/HIP support to KernelAgent alongside the existing NVIDIA CUDA and Intel XPU backends. All changes are modular and do not modify the CUDA or XPU code paths. ## Platform detection & device handling - Add `--target-platform rocm` option to platform registry - ROCm device string is `"cuda"` (PyTorch uses HIP-CUDA compatibility layer) - Platform guidance instructs LLM to use wavefront size 64 (vs NVIDIA warp size 32) - Preferred block sizes are multiples of 64 for AMD CDNA architecture ## Profiling (rocprof replaces NCU for ROCm path) New files: - `kernel_perf_agent/kernel_opt/profiler/rocprof_profiler.py`: Wraps `rocprof` (or `rocprofv3`) to collect hardware PMC counters in two passes: (1) `--stats` for kernel timing, (2) `-i input.txt` for SQ/TCC counters - `triton_kernel_agent/opt_worker_component/profiling/rocprof_wrapper_factory.py`: Factory generating rocprof-compatible benchmark wrapper scripts - `triton_kernel_agent/opt_worker_component/profiling/rocprof_wrapper_template.j2`: Jinja2 template for the wrapper (mirrors ncu_wrapper_template.j2 but uses `torch.cuda` which works on ROCm via HIP) - `triton_kernel_agent/opt_worker_component/profiling/rocm_kernel_profiler.py`: Drop-in replacement for `KernelProfiler` using rocprof Counter mapping: - Compute utilization: SQ_WAVES, SQ_INSTS_VALU, SQ_INSTS_SALU - Memory bandwidth: TCC_EA_RDREQ/WRREQ, FETCH_SIZE, WRITE_SIZE - Cache hit rate: TCC_HIT / (TCC_HIT + TCC_MISS) - Stall analysis: SQ_WAIT_INST_ANY ## Roofline analysis (heuristic SOL from rocprof counters) New file: - `kernel_perf_agent/kernel_opt/roofline/rocm_roofline.py`: `ROCmRooflineAnalyzer` derives heuristic compute/memory SOL estimates from raw PMC counters (VALU utilization and VMEM fraction x L2 miss amplifier). Same interface as `RooflineAnalyzer` (NCU SOL-based). ## Hardware specs for roofline Extended `gpu_specs_database.py` with AMD Instinct GPUs: - AMD Instinct MI300X (CDNA3/gfx942): 1307.4 TFLOPS BF16, 5.3 TB/s HBM3, 304 CUs - AMD Instinct MI300A (CDNA3/gfx942 APU): 980.6 TFLOPS BF16, 3.2 TB/s HBM3 - AMD Instinct MI350X (CDNA4/gfx950): ~2304 TFLOPS BF16, ~8 TB/s HBM3E - AMD Instinct MI250X (CDNA2/gfx90a): 383 TFLOPS BF16, 3.3 TB/s HBM2e ## Platform registry integration New file: - `triton_kernel_agent/platform/rocm.py`: Full set of ROCm platform implementations (ROCmVerifier, ROCmBenchmarker, ROCmWorkerRunner, ROCmAcceleratorSpecsProvider, ROCmKernelProfilerWrapper, ROCmRooflineAnalyzerWrapper, ROCmBottleneckAnalyzer, ROCmRAGPrescriber) Updated files: - `triton_kernel_agent/platform/registry.py`: Register all ROCm components under the "rocm" implementation name - `triton_kernel_agent/platform/__init__.py`: Export ROCm classes Testing on MI300X/MI350X hardware is planned. The noop platform can be used in CI as a stand-in for rocprof-free environments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

meta-cla · 2026-03-13T05:15:13Z

Hi @andyluo7!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

andyluo7 · 2026-03-13T05:31:37Z

MI300X Test Results ✅

Tested on AMD Instinct MI300X cluster (8x GPUs, ROCm 6.2/7.0.2, gfx942, hostname: tw015):

Unit Tests

$ python -m pytest tests/ -v
========================= 26 passed in 3.33s =========================

All 26 existing tests pass with no regressions.

ROCm Integration Tests

Test	Result
Platform config (`--target-platform rocm`)	✅ PASS
Platform choices list includes `rocm`	✅ `['cuda', 'rocm', 'xpu']`
Device string = `cuda` (HIP compat)	✅ PASS
Wavefront size 64 in guidance	✅ PASS
MI300X specs (1307.4 TFLOPS BF16, 5300 GB/s, 304 CUs)	✅ PASS
MI350X specs (2304.0 TFLOPS BF16, 8000 GB/s, 304 CUs)	✅ PASS
MI250X specs	✅ PASS
MI300A specs	✅ PASS
ROCm Roofline Analyzer instantiation	✅ PASS
Registry: all 8 components have `rocm` implementation	✅ PASS
`rocprof` / `rocprofv3` detection	✅ Available

Environment

GPUs: 8x AMD Instinct MI300X (gfx942), 192 GB VRAM each
ROCm: 6.2 (host), 7.0.2 (docker)
PyTorch: 2.5.1+rocm6.2
Triton: 3.1.0 (pytorch-triton-rocm)
Python: 3.10.12

Note: rocprof v1 --stats has a known segfault with Triton JIT-compiled kernels on ROCm 6.2. This is a rocprof v1 bug, not a code issue — rocprofv3 and rocprof on ROCm 7.0+ work correctly.

andyluo7 · 2026-03-13T05:41:58Z

End-to-End Benchmark Results on MI300X 🔥

Ran the NVIDIA-optimized example kernels directly on AMD MI300X to demonstrate cross-platform portability and the optimization opportunity:

RMSNorm (examples/optimize_02_rmsnorm)

Input: (112, 64, 512, 512) float32
Correctness: max_diff=0.000074 ✓

PyTorch: 12.995 ms
Triton:   5.893 ms
Speedup:  2.21x ✓

MatVec (examples/optimize_01_matvec)

Input: A=(2048, 1048576), b=(1048576, 1) bfloat16
Correctness: max_diff=0.0000 ✓

PyTorch:  1.161 ms
Triton:  10.401 ms
Speedup:  0.11x ← NVIDIA-tuned kernel is 9x SLOWER on AMD

Key Insight

The MatVec kernel was specifically optimized for NVIDIA GPUs (warp size 32, CUDA-specific block tiling). Running on AMD MI300X (wavefront size 64) makes it 9x slower than PyTorch's built-in GEMV.

This is exactly the use case for KernelAgent + ROCm support: the profiler would detect the bottleneck (wavefront inefficiency, suboptimal block sizes) and the LLM optimizer would generate an AMD-tuned variant with:

Block sizes that are multiples of 64
Wavefront-aligned memory access patterns
CU-aware parallelism (304 CUs on MI300X)

The RMSNorm kernel, being more memory-bound and less warp-size-dependent, achieves a healthy 2.21x speedup even without AMD-specific tuning — but could likely be further improved with KernelAgent's ROCm profiling feedback.

Hardware

8x AMD Instinct MI300X (gfx942), 192 GB VRAM each
ROCm 6.2, PyTorch 2.5.1+rocm6.2, Triton 3.1.0

meta-cla · 2026-03-13T07:06:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

…profv3 compatibility Fixes three categories of issues found during MI300X E2E testing: 1. Restore accidentally deleted methods from ROCm port (0300c77): - _profile_and_analyze(): profiles kernel with rocprof, analyzes bottlenecks - _generate_optimized_kernel(): calls LLM to generate optimized kernel code - _profile_kernel_for_sol(): profiles for Speed of Light metrics These methods were deleted while their call sites remained, causing AttributeError crashes during optimization rounds. 2. Add synthetic bottleneck analysis fallback: - When rocprof profiling fails (GPU contention, version mismatch), the orchestrator now generates synthetic analysis from kernel code patterns (load/store/dot counts, dimension analysis) instead of skipping the round entirely. - Added early exit on rocprof segfault/signal crashes to avoid wasting retry attempts. 3. rocprofv3 compatibility fixes: - Add --kernel-trace flag required by rocprofv3 for --stats - Add -- separator before python command in rocprof invocations - Fix target_platform kwarg leak into OptimizationWorker 4. Increase test_timeout_s from 30 to 300 seconds across all entry points (agent, manager, worker, fuser) to handle ROCm Triton JIT compilation on first run. 5. Add "rocm" to _STRATEGIES list in run_opt_manager.py. Tested: E2E MatVec on MI300X achieves 3.4x speedup (10.45ms -> 3.04ms).

Jack-Khuu · 2026-03-17T07:34:06Z

Thanks for the PR!!! Excited to dig into it

Jack-Khuu

Still making my way through but things are looking solid so far ,can you add a config.yaml to
https://github.com/meta-pytorch/KernelAgent/tree/main/examples%2Fconfigs

So that's folks can test AMD optimization OOTB?

@Jack-Khuu

Addresses feedback from @Jack-Khuu to provide a config file for out-of-the-box testing of AMD GPU optimization. Also includes Ruff formatting fixes for CI.

andyluo7 · 2026-03-18T09:02:30Z

Thanks for the feedback, @Jack-Khuu! I've added an amd.yaml config file to examples/configs/ to make it easier to test AMD optimization out-of-the-box.

I also saw the Ruff formatting CI check was failing, so I ran ruff format on the indicated files. It was a no-op locally, but I've included it in the commit just in case it resolves the CI issue.

Let me know if there's anything else I can do to help move this forward!

Jack-Khuu · 2026-03-19T01:12:32Z

I'm testing the changes against https://github.com/ScalingIntelligence/KernelBench/tree/main/KernelBench/level1 so just waiting on some results there, no changes needed yet

(Not needed for this PR) Out of curiosity, have you had a chance to test on a MI350?

Jack-Khuu · 2026-03-19T01:42:44Z

examples/run_opt_manager.py


 # Available strategies and their config files.
-_STRATEGIES = ["beam_search", "greedy", "noop", "nvidia"]
+_STRATEGIES = ["beam_search", "greedy", "noop", "nvidia", "rocm"]


Suggested change

_STRATEGIES = ["beam_search", "greedy", "noop", "nvidia", "rocm"]

_STRATEGIES = ["beam_search", "greedy", "noop", "nvidia", "amd"]

Jack-Khuu · 2026-03-19T01:44:15Z

examples/configs/amd.yaml

+# Usage:
+#   python examples/run_opt_manager.py \
+#       --kernel-dir examples/optimize_01_matvec \
+#       --config examples/configs/amd.yaml


Suggested change

# --config examples/configs/amd.yaml

# --strategy amd

You actually flagged a typo in nvidia.yaml too (it's also supposed to be strategy)

andyluo7 · 2026-03-19T02:23:13Z

Great question! I do have access to an MI350X (gfx950/CDNA4) — I've been using it for ISA verification work on other ROCm kernel PRs. Haven't run KernelAgent on it yet, but happy to do a test run once this PR lands.

The main thing to watch for on MI350X would be the Triton compiler's handling of gfx950-specific ISA (e.g., 256-bit vector ops decompose into 2× dwordx4 rather than native dwordx8). The profiling stack (rocprofv3) works well on ROCm 7.2. Would be a good follow-up to validate CDNA4 support end-to-end.

…ig comments - Rename 'rocm' to 'amd' in _STRATEGIES list (run_opt_manager.py) - Add --strategy amd to amd.yaml usage comment - Add --strategy nvidia to nvidia.yaml usage comment (fixes existing typo)

andyluo7 · 2026-03-19T02:28:29Z

Done! Addressed all three review comments:

Renamed rocm → amd in _STRATEGIES list (run_opt_manager.py)
Added --strategy amd to amd.yaml usage comment
Also fixed the nvidia.yaml typo — added --strategy nvidia to its usage comment

Thanks for catching these @Jack-Khuu!

Jack-Khuu

Thanks for the submission!! This is looking almost exactly how i imagined it would be!!

I'm still making my way through testing with different rocprof versions/hitting edge cases, so I might have a few more comments, but the architecture/modules is looking good

Q:

Can you add the AMD/ROCm support to the front page README?
(with any romcm/rocprof reqs). I want to make sure folks can find yall
Did you run into any issues with the profiler formats _get_triton_kernel_metrics? I'm debugging some formatting difference with how ncu_profiler.py and rocprof_profiler.py format the outputs.
- I'm testing with python examples/run_opt_manager.py --kernel-dir examples/optimize_01_matvec --strategy amd
_synthetic_bottleneck_analysis is a nice addition. Let's have that as a separate PR though so it's easy to point to as a reference

Also, if you haven't heard about it: We are hosting a $1 Million AMD Kernel Contest https://luma.com/cqq4mojz - https://www.gpumode.com/home

Jack-Khuu · 2026-03-19T02:58:45Z

examples/configs/amd.yaml

+#   python examples/run_opt_manager.py \
+#       --kernel-dir examples/optimize_01_matvec \
+#       --strategy amd
+#       --config examples/configs/amd.yaml


We can actually remove teh --config arg, strategy will resolve it for us

Jack-Khuu · 2026-03-19T18:18:05Z

triton_kernel_agent/platform/rocm.py

+        logger: logging.Logger | None = None,
+        log_dir: Path | None = None,
+        artifacts_dir: Path | None = None,
+        rocprof_bin_path: str | None = None,


Do we have a way of setting rocprof_bin_path at any point?

Let's allow callers to pick which bin to use; similar to how we can provide the ncu_bin_path in the yaml.

platform_kwargs: rocprof_bin_path: <>

Should be a 3-4 line change i think adding a platform_kwargs to OptimizationWorker that gets unpacked in create_from_config

KernelAgent/triton_kernel_agent/opt_worker.py

Line 78 in 68608d6

ncu_bin_path: str | None = None,

KernelAgent/triton_kernel_agent/opt_worker.py

Line 239 in 68608d6

resolved = registry.create_from_config(

Jack-Khuu · 2026-03-19T22:36:18Z

kernel_perf_agent/kernel_opt/profiler/rocprof_profiler.py

+    timeout: int,
+) -> None:
+    """Run rocprof -i <input_file> to collect PMC counters."""
+    # rocprofv3 uses --pmc or -i for counter collection


Why do we not need to account for v2 vs v3 in _run_rocprof_pmc, but do in _run_rocprof_stats?

Jack-Khuu · 2026-03-19T22:52:26Z

kernel_perf_agent/kernel_opt/profiler/rocprof_profiler.py

+import subprocess
+import sys
+from pathlib import Path
+from typing import Any, Dict, List, Optional


nit: let's use native typing where possible for consistency (dict, list, | None)

Jack-Khuu · 2026-03-19T23:09:52Z

triton_kernel_agent/opt_worker_component/orchestrator/optimization_orchestrator.py

-                continue
+                # Generate synthetic bottleneck analysis from kernel code structure
+                # This allows the LLM to still optimize based on code patterns
+                bottleneck_results = self._synthetic_bottleneck_analysis(


Love it, but can we put this in a separate PR? This is an architectural feature I can help add to the diagrams

Signed-off-by: Andy Luo <andy.linluo@gmail.com>

andyluo7 · 2026-03-20T00:47:00Z

Thanks @Jack-Khuu — addressed the review items in the latest push (commit c2b02d7):\n\n1. Added AMD/ROCm support notes to the front-page README, including ROCm / rocprof requirements and a quickstart pointing to .\n2. Removed from this PR to keep the ROCm bring-up changes focused; the orchestrator now falls back to the original "skip round if no analysis is available" behavior.\n3. Added a clarification around : it is NCU/CUDA-specific, while the ROCm path already works with flat metrics directly. I also made the roofline logging block gracefully no-op on ROCm rather than trying to treat ROCm metrics like NCU metrics.\n\nI did a host-side sanity check before pushing:\n- the touched orchestrator file parses cleanly ()\n- is clean\n\nI couldn’t run the full pytest suite on that host because the local environment is missing Python deps like , but the patch itself is clean and minimal.\n\nHappy to follow up on any additional rocprof edge cases you uncover while testing.

andyluo7 · 2026-03-20T00:50:34Z

Thanks @Jack-Khuu — addressed the review items in the latest push (commit c2b02d7):

Added AMD/ROCm support to the front-page README, including ROCm / rocprof requirements and a quickstart pointing to examples/configs/amd.yaml.
Removed _synthetic_bottleneck_analysis from this PR to keep the ROCm bring-up changes focused; the orchestrator now falls back to the original "skip round if no analysis is available" behavior.
Clarified that _get_triton_kernel_metrics is NCU/CUDA-specific, while the ROCm path already works with flat metrics directly. I also made the roofline logging block gracefully no-op on ROCm rather than trying to treat ROCm metrics like NCU metrics.

I also did a host-side sanity check before pushing:

python3 -m py_compile triton_kernel_agent/opt_worker_component/orchestrator/optimization_orchestrator.py ✅
git diff --check ✅

I couldn’t run the full pytest suite on that host because the local environment is missing Python deps like omegaconf, but the patch itself is clean and minimal.

Happy to follow up on any additional rocprof edge cases you uncover while testing.

kaiming-cheng · 2026-03-20T20:25:18Z

Hi @andyluo7, thank you for your valuable contribution! We're excited to enable KernelAgent on AMD with your PR

One thing I noticed when running the e2e experiment: python examples/run_opt_manager.py --kernel-dir examples/optimize_01_matvec --strategy amd is that rocprofv3 --pmc (PMC counter collection) can hang indefinitely when profiling on my MI300X machine. When falling back to rocprofv2, there's also a deprecation warning indicating that this is replaced by the new ROCprofiler-SDK library.

The synthetic fallback works nicely and we do see performance improvements! However, it would be helpful if you could address the profiling reliability user might experienced. More generally, it'd be great to better align the AMD profiling tooling with KernelAgent's expectations given these upstream constraints.

Thanks again for the work here!

andyluo7 · 2026-03-21T04:54:42Z

Fixes the ROCm fallback reprofiling loop.
After rocprof failure, the AMD path now stays in synthetic fallback mode instead of re-entering profiling later in the round.
Validated on MI300X with default AMD config: rocprof still fails as expected, but optimization now continues successfully and finds improved kernels.

Jack-Khuu · 2026-03-23T17:08:53Z

rocprof still fails as expected

Not sure this is the desired behavior? Rocprof should work in at least one of the execution cases (let's call out the cases where it does)

@kaiming-cheng

FETCH_SIZE and WRITE_SIZE are derived counters that use the same internal hardware PMC block on MI300X (gfx942). When collected in the same pass, rocprofv3 hangs indefinitely or crashes with error 38: 'Request exceeds the capabilities of the hardware to collect'. Split the original 2-pass PMC collection into 4 passes: Pass 1: SQ counters (wavefront/shader utilization) — unchanged Pass 2: TCC counters (L2 cache hit/miss/EA requests) Pass 3: FETCH_SIZE (memory read bandwidth) Pass 4: WRITE_SIZE (memory write bandwidth) Verified on MI300X (8x gfx942, ROCm 7.0.2, rocprofv3 1.0.0): - All 4 passes complete successfully - Counter values are correct (tested with HIP vector_add kernel) This resolves the rocprofv3 --pmc hang reported by @kaiming-cheng.

andyluo7 · 2026-03-24T18:30:25Z

Good catch @Jack-Khuu — my wording was misleading. rocprofv3 does work and we've validated it end-to-end. Let me clarify the cases:

✅ rocprofv3 works when:

GPUs are not occupied by other workloads (no competing HSA consumers)
ROCm 7.0+ (tested on ROCm 7.0.2 / gfx942 / MI300X)
PMC counters are grouped correctly (see fix below)

We ran clean E2E tests (MatVec example) with all optimization rounds using real rocprofv3 profiling data — no synthetic fallback — and got 10.45ms → 3.79ms (2.77x speedup).

❌ rocprofv3 fails when:

GPU contention — another process holds HSA hardware performance counters (e.g. a serving framework). This causes HSA runtime SIGABRT (rc=-11). This is an upstream ROCm constraint.
Counter grouping conflict — this is likely what @kaiming-cheng hit. FETCH_SIZE and WRITE_SIZE are derived counters that use the same internal PMC block on gfx942. When collected in the same pass, rocprofv3 hangs indefinitely or crashes with error 38 ("Request exceeds the capabilities of the hardware to collect").

Fix (just pushed, commit 262a622): Split the PMC collection from 2 passes to 4 passes:

Pass 1: SQ counters (unchanged)
Pass 2: TCC_HIT/MISS/EA counters
Pass 3: FETCH_SIZE (alone)
Pass 4: WRITE_SIZE (alone)

Verified on MI300X (8x gfx942, ROCm 7.0.2, rocprofv3 1.0.0) — all 4 passes complete successfully with correct counter values.

The previous commit's "rocprof still fails as expected" was about the fallback loop fix — ensuring that when rocprof does fail (due to contention), the code stays in synthetic mode rather than re-entering profiling. The normal path is real rocprofv3 profiling.

facebook-github-tools bot added the module: rocm label Mar 13, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 13, 2026

Jack-Khuu self-requested a review March 16, 2026 23:35

eemars and others added 2 commits March 17, 2026 10:59

fix(rocm): resolve ruff lint errors and test_timeout_s mismatch

9f5eb24

style: fix ruff format for ROCm support files

e5465a1

Jack-Khuu requested a review from kaiming-cheng March 18, 2026 02:21

Jack-Khuu reviewed Mar 18, 2026

View reviewed changes

feat(rocm): Add amd.yaml config for OOTB testing

644ead2

Addresses feedback from @Jack-Khuu to provide a config file for out-of-the-box testing of AMD GPU optimization. Also includes Ruff formatting fixes for CI.

Jack-Khuu reviewed Mar 19, 2026

View reviewed changes

Address review: rename rocm→amd in strategies, add --strategy to conf…

4fc0331

…ig comments - Rename 'rocm' to 'amd' in _STRATEGIES list (run_opt_manager.py) - Add --strategy amd to amd.yaml usage comment - Add --strategy nvidia to nvidia.yaml usage comment (fixes existing typo)

Jack-Khuu reviewed Mar 19, 2026

View reviewed changes

Address ROCm review follow-ups

c2b02d7

Signed-off-by: Andy Luo <andy.linluo@gmail.com>

Fix ROCm synthetic fallback reprofiling loop

1eceb3c

style: fix ruff format for optimization_orchestrator.py

24b227d

	_STRATEGIES = ["beam_search", "greedy", "noop", "nvidia", "rocm"]
	_STRATEGIES = ["beam_search", "greedy", "noop", "nvidia", "amd"]

Conversation

andyluo7 commented Mar 13, 2026

Summary

Architecture

ROCm Profiling Design

Hardware Specs Added

Platform Guidance for LLM

Test Plan

Notes

Uh oh!

meta-cla bot commented Mar 13, 2026

Action Required

Process

Uh oh!

andyluo7 commented Mar 13, 2026

MI300X Test Results ✅

Unit Tests

ROCm Integration Tests

Environment

Uh oh!

andyluo7 commented Mar 13, 2026

End-to-End Benchmark Results on MI300X 🔥

RMSNorm (examples/optimize_02_rmsnorm)

MatVec (examples/optimize_01_matvec)

Key Insight

Hardware

Uh oh!

meta-cla bot commented Mar 13, 2026

Uh oh!

Jack-Khuu commented Mar 17, 2026

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

andyluo7 commented Mar 18, 2026

Uh oh!

Jack-Khuu commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyluo7 commented Mar 19, 2026

Uh oh!

andyluo7 commented Mar 19, 2026

Uh oh!

Jack-Khuu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andyluo7 commented Mar 20, 2026

Uh oh!

andyluo7 commented Mar 20, 2026

Uh oh!

kaiming-cheng commented Mar 20, 2026

Uh oh!

andyluo7 commented Mar 21, 2026

Uh oh!

Jack-Khuu commented Mar 23, 2026

Uh oh!

andyluo7 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Jack-Khuu commented Mar 19, 2026 •

edited

Loading

Jack-Khuu left a comment •

edited

Loading