Notes and scripts for AMD profiling of dycore by iomaganaris · Pull Request #1047 · C2SM/icon4py

iomaganaris · 2026-02-05T16:19:24Z

This Pull Request includes scripts to benchmark and profile the dycore granule as well as one of the most time consuming GT4Py Programs of it, the vertically_implicit_solver_at_predictor_step.

We'll keep this PR open for interaction and keep it up-to-date with improvements.

The PR includes the following important files:

AMD_INTRODUCTION.md: Includes (hopefully) all the informations necessary to run the benchmark scripts for the dycore granule and the vertically_implicit_solver_at_predictor_step as well as an introduction on icon4py, GT4Py and DaCe. There are also some suggestions regarding how to view and understand the generated code
amd_scripts/install_icon4py_venv.sh: Script to install icon4py along with all the dependencies necessary to run the profilers
amd_scripts/benchmark_dycore.sh: Sbatch script for Beverin to run and time the GT4Py Programs of the dycore
amd_scripts/benchmark_solver.sh: Sbatch script for Beverin to benchark and profile the vertically_implicit_solver_at_predictor_step. Looking at the profiles of the kernels generated by this GT4Py program is the most interesting topic as it should improve the performance across most of the other dycore GT4Py Programs as well

Currently, based on #1018 which points to GT4Py/main (which will become GT4Py v1.1.4 in the next week).

havogt · 2026-02-06T10:14:32Z

+fi
+
+# Install icon4py, gt4py, DaCe and other basic dependencies using uv
+uv sync --extra all --python $(which python3.12)


I would not install all the extras but maybe we properly add cupy-rocm7 as an extra to avoid line 29. I can work on that.

…into amd_profiling

…amd_profiling

…eedup - Set gpu_block_size_2d=(256,1,1) for ROCm in model_options.py (verified: GT4Py Timer 0.763 → 0.604 ms median, 1000 runs) - Add MI300A vs GH200 verified comparison (both at (256,1,1)): GH200 1.13x faster — saturates HBM at 89% peak vs MI300A 43%; MI300A's caches absorb 61% of demand bytes vs GH200's 43% - Add block-size effect verified on MI300A: L2 hit rate jumps 15-20% → 32-50% on 2D heavy kernels with (256,1,1) - Correct earlier "86 μs inter-kernel gap" claim — was a calculation artifact (rocprof kernel sum vs pytest-benchmark wall time); actual GPU inter-kernel gap is ~7-11 μs (1-2%) - Re-label "Achieved HBM BW" → "Demand BW" (94% of peak is demand, not what HBM physically delivers) - Add scripts: extract_pmc.py, GH200 benchmark/profile scripts, occupancy sweep, set_waves_per_eu utility

…s-platform data

…amd_profiling

Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@cscs.ch>

extract_pmc.py: rewrite with three cache-plane columns (L1<->L2, L2-Fabric, HBM) matching rocprof-compute analyze section 4.1.9 within ~1%, validated on 5 nodes across both clusters and rocprof-compute versions 3.4.0 and 4.x. CLUSTER_NODE_VARIANCE.md: cross-cluster comparison (aac6 vs beverin), within-cluster silicon binning (13-26% spread), per-kernel HBM tables, firmware/driver comparison, reproducer commands, methodology gotchas. setup_env.sh: uenv-agnostic (autodetect ROCM_VERSION + rocprofiler-dev path). setup_env_rocm72.sh: beverin icon4py-rocm72 venv setup with gt4py PR 2578. capture_env.sh, trace_power.sh: env capture and under-load power/clock trace. Sbatch scripts (cluster-paired aac6/rocm72): gt4py_timer, rocprof_compute, trace_power. Plus job-capture-env.sh, job-rocprof-compute-improved.sh aac6 helpers. benchmark_solver.sh: switch from .venv to .venv_rocm. PROFILING_RESULTS.md, DEEP_ANALYSIS.md: updated extract_pmc references.

AMD audited the three aac6 nodes and found everything bit-identical except the GPU silicon IDs. The mechanism is APCC (the SMC's leakage-aware governor) backing off harder on the leakier chips. Cross-cluster gap is still open since aac6 and beverin run different SMC firmware vintages.

github-actions · 2026-05-12T18:22:51Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

edopao and others added 11 commits January 29, 2026 10:52

update gt4py version

be4cee4

switch gt4py branch

6ecff32

update uv lock

1c9c744

edit import metrics

a1e753f

switch gt4py branch

b45b9b1

edit import metrics

517d122

edit import metrics

672b4f0

Merge branch 'main' into update_dace_version

9b2662d

Update DaCe version

532c125

Update the gt4py commit

991b6b8

Initial amd notes and scripts

f194d83

iomaganaris requested a review from havogt February 5, 2026 16:19

havogt and others added 5 commits February 5, 2026 17:20

Pre-compilation fix with_backend

1eb4708

Fixes to the notes

30fe86c

Additional comments in the scripts

4d13d82

Fix gtx_metrics

81e7a24

Clean up setup script

47e5e48