Skip to content

Notes and scripts for AMD profiling of dycore#1047

Draft
iomaganaris wants to merge 93 commits into
mainfrom
amd_profiling
Draft

Notes and scripts for AMD profiling of dycore#1047
iomaganaris wants to merge 93 commits into
mainfrom
amd_profiling

Conversation

@iomaganaris
Copy link
Copy Markdown
Collaborator

@iomaganaris iomaganaris commented Feb 5, 2026

This Pull Request includes scripts to benchmark and profile the dycore granule as well as one of the most time consuming GT4Py Programs of it, the vertically_implicit_solver_at_predictor_step.

We'll keep this PR open for interaction and keep it up-to-date with improvements.

The PR includes the following important files:

  • AMD_INTRODUCTION.md: Includes (hopefully) all the informations necessary to run the benchmark scripts for the dycore granule and the vertically_implicit_solver_at_predictor_step as well as an introduction on icon4py, GT4Py and DaCe. There are also some suggestions regarding how to view and understand the generated code
  • amd_scripts/install_icon4py_venv.sh: Script to install icon4py along with all the dependencies necessary to run the profilers
  • amd_scripts/benchmark_dycore.sh: Sbatch script for Beverin to run and time the GT4Py Programs of the dycore
  • amd_scripts/benchmark_solver.sh: Sbatch script for Beverin to benchark and profile the vertically_implicit_solver_at_predictor_step. Looking at the profiles of the kernels generated by this GT4Py program is the most interesting topic as it should improve the performance across most of the other dycore GT4Py Programs as well

Currently, based on #1018 which points to GT4Py/main (which will become GT4Py v1.1.4 in the next week).

@iomaganaris iomaganaris requested a review from havogt February 5, 2026 16:19
Comment thread install_icon4py_uenv.sh Outdated
Comment thread INTRODUCTION.md Outdated
Comment thread INTRODUCTION.md Outdated
Comment thread benchmark_dycore.sh Outdated
Comment thread install_icon4py_uenv.sh Outdated
Comment thread install_icon4py_uenv.sh Outdated
fi

# Install icon4py, gt4py, DaCe and other basic dependencies using uv
uv sync --extra all --python $(which python3.12)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not install all the extras but maybe we properly add cupy-rocm7 as an extra to avoid line 29. I can work on that.

Comment thread install_icon4py_uenv.sh Outdated
Comment thread install_icon4py_uenv.sh Outdated
dganellari and others added 29 commits April 18, 2026 00:38
…eedup

- Set gpu_block_size_2d=(256,1,1) for ROCm in model_options.py
  (verified: GT4Py Timer 0.763 → 0.604 ms median, 1000 runs)
- Add MI300A vs GH200 verified comparison (both at (256,1,1)):
  GH200 1.13x faster — saturates HBM at 89% peak vs MI300A 43%;
  MI300A's caches absorb 61% of demand bytes vs GH200's 43%
- Add block-size effect verified on MI300A:
  L2 hit rate jumps 15-20% → 32-50% on 2D heavy kernels with (256,1,1)
- Correct earlier "86 μs inter-kernel gap" claim — was a calculation
  artifact (rocprof kernel sum vs pytest-benchmark wall time);
  actual GPU inter-kernel gap is ~7-11 μs (1-2%)
- Re-label "Achieved HBM BW" → "Demand BW" (94% of peak is demand,
  not what HBM physically delivers)
- Add scripts: extract_pmc.py, GH200 benchmark/profile scripts,
  occupancy sweep, set_waves_per_eu utility
Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@cscs.ch>
extract_pmc.py: rewrite with three cache-plane columns (L1<->L2,
L2-Fabric, HBM) matching rocprof-compute analyze section 4.1.9 within
~1%, validated on 5 nodes across both clusters and rocprof-compute
versions 3.4.0 and 4.x.

CLUSTER_NODE_VARIANCE.md: cross-cluster comparison (aac6 vs beverin),
within-cluster silicon binning (13-26% spread), per-kernel HBM tables,
firmware/driver comparison, reproducer commands, methodology gotchas.

setup_env.sh: uenv-agnostic (autodetect ROCM_VERSION + rocprofiler-dev
path).
setup_env_rocm72.sh: beverin icon4py-rocm72 venv setup with gt4py PR
2578.
capture_env.sh, trace_power.sh: env capture and under-load power/clock
trace.

Sbatch scripts (cluster-paired aac6/rocm72): gt4py_timer,
rocprof_compute, trace_power. Plus job-capture-env.sh,
job-rocprof-compute-improved.sh aac6 helpers.

benchmark_solver.sh: switch from .venv to .venv_rocm.
PROFILING_RESULTS.md, DEEP_ANALYSIS.md: updated extract_pmc references.
AMD audited the three aac6 nodes and found everything bit-identical except
the GPU silicon IDs. The mechanism is APCC (the SMC's leakage-aware
governor) backing off harder on the leakier chips. Cross-cluster gap is
still open since aac6 and beverin run different SMC firmware vintages.
@github-actions
Copy link
Copy Markdown

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants