Notes and scripts for AMD profiling of dycore#1047
Draft
iomaganaris wants to merge 93 commits into
Draft
Conversation
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
| fi | ||
|
|
||
| # Install icon4py, gt4py, DaCe and other basic dependencies using uv | ||
| uv sync --extra all --python $(which python3.12) |
Contributor
There was a problem hiding this comment.
I would not install all the extras but maybe we properly add cupy-rocm7 as an extra to avoid line 29. I can work on that.
havogt
reviewed
Feb 6, 2026
havogt
reviewed
Feb 6, 2026
…into amd_profiling
…eedup - Set gpu_block_size_2d=(256,1,1) for ROCm in model_options.py (verified: GT4Py Timer 0.763 → 0.604 ms median, 1000 runs) - Add MI300A vs GH200 verified comparison (both at (256,1,1)): GH200 1.13x faster — saturates HBM at 89% peak vs MI300A 43%; MI300A's caches absorb 61% of demand bytes vs GH200's 43% - Add block-size effect verified on MI300A: L2 hit rate jumps 15-20% → 32-50% on 2D heavy kernels with (256,1,1) - Correct earlier "86 μs inter-kernel gap" claim — was a calculation artifact (rocprof kernel sum vs pytest-benchmark wall time); actual GPU inter-kernel gap is ~7-11 μs (1-2%) - Re-label "Achieved HBM BW" → "Demand BW" (94% of peak is demand, not what HBM physically delivers) - Add scripts: extract_pmc.py, GH200 benchmark/profile scripts, occupancy sweep, set_waves_per_eu utility
Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@cscs.ch>
extract_pmc.py: rewrite with three cache-plane columns (L1<->L2, L2-Fabric, HBM) matching rocprof-compute analyze section 4.1.9 within ~1%, validated on 5 nodes across both clusters and rocprof-compute versions 3.4.0 and 4.x. CLUSTER_NODE_VARIANCE.md: cross-cluster comparison (aac6 vs beverin), within-cluster silicon binning (13-26% spread), per-kernel HBM tables, firmware/driver comparison, reproducer commands, methodology gotchas. setup_env.sh: uenv-agnostic (autodetect ROCM_VERSION + rocprofiler-dev path). setup_env_rocm72.sh: beverin icon4py-rocm72 venv setup with gt4py PR 2578. capture_env.sh, trace_power.sh: env capture and under-load power/clock trace. Sbatch scripts (cluster-paired aac6/rocm72): gt4py_timer, rocprof_compute, trace_power. Plus job-capture-env.sh, job-rocprof-compute-improved.sh aac6 helpers. benchmark_solver.sh: switch from .venv to .venv_rocm. PROFILING_RESULTS.md, DEEP_ANALYSIS.md: updated extract_pmc references.
AMD audited the three aac6 nodes and found everything bit-identical except the GPU silicon IDs. The mechanism is APCC (the SMC's leakage-aware governor) backing off harder on the leakier chips. Cross-cluster gap is still open since aac6 and beverin run different SMC firmware vintages.
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This Pull Request includes scripts to benchmark and profile the
dycore granuleas well as one of the most time consumingGT4Py Programs of it, thevertically_implicit_solver_at_predictor_step.We'll keep this PR open for interaction and keep it up-to-date with improvements.
The PR includes the following important files:
AMD_INTRODUCTION.md: Includes (hopefully) all the informations necessary to run the benchmark scripts for thedycore granuleand thevertically_implicit_solver_at_predictor_stepas well as an introduction onicon4py,GT4PyandDaCe. There are also some suggestions regarding how to view and understand the generated codeamd_scripts/install_icon4py_venv.sh: Script to installicon4pyalong with all the dependencies necessary to run the profilersamd_scripts/benchmark_dycore.sh: Sbatch script forBeverinto run and time theGT4Py Programs of thedycoreamd_scripts/benchmark_solver.sh: Sbatch script forBeverinto benchark and profile thevertically_implicit_solver_at_predictor_step. Looking at the profiles of the kernels generated by thisGT4Py programis the most interesting topic as it should improve the performance across most of the otherdycoreGT4Py Programs as wellCurrently, based on #1018 which points to GT4Py/main (which will become GT4Py v1.1.4 in the next week).