Skip to content

[CI] add mi300 pipeline#669

Open
chaoos wants to merge 7 commits intomasterfrom
feature/cicd-mi300
Open

[CI] add mi300 pipeline#669
chaoos wants to merge 7 commits intomasterfrom
feature/cicd-mi300

Conversation

@chaoos
Copy link
Contributor

@chaoos chaoos commented Mar 9, 2026

This PR adds requirements on the side of tmLQCD for CI/CD testing on the CSCS test system "beverin". This system hosts AMD MI300A GPUs.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 9, 2026

cscs-ci run beverin

@chaoos
Copy link
Contributor Author

chaoos commented Mar 9, 2026

This currently fails due to missing access to the beverin test system at CSCS.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 10, 2026

A manual build using quda branch feature/prefetch2 did work, see https://cicd-ext-mw.cscs.ch/ci/pipeline/results/3690753405420143/64239695/2375574724?iid=1711

I used this command on beverin:

uenv build .ci/uenv-recipes/tmlqcd/beverin-mi300 tmlqcd/quda-prefetch2@beverin%mi300

with this spack spec for quda;

  specs:
  - "quda@git.feature/prefetch2 +qdp +multigrid +twisted_clover +twisted_mass"

The image is available in the service namespace of CSCSs uenv registry:

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

@mtaillefumier
Copy link
Contributor

I learned that the mi300 cluster uses a different authentification mechanism that's why it is failing.

@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

@mtaillefumier
Copy link
Contributor

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Donig the reverse would work.

@mtaillefumier
Copy link
Contributor

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work.

1 similar comment
@mtaillefumier
Copy link
Contributor

$ uenv image find service::
uenv                                       arch   system   id                size(MB)  date
tmlqcd/quda-prefetch2:2375574724           mi300  beverin  8f0acefe49988d34   3,857    2026-03-10

But the gcc compiler in the uenv is broken 😵:

$ uenv start tmlqcd --view=default
$ gcc --version
Illegal instruction (core dumped)

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work.

I see that makes sense. Apparently, I started the uenv on the login node which has mi200s. I did start now on a compute node with mi300s and the compiler seems to work. Next problem is that cmake's C compiler test fails:

configure:2981: $? = 1
configure:3001: checking whether the C compiler works
configure:3023: /user-environment/env/default/bin/mpicc -O3 -fopenmp -mtune=neoverse-v2 -mcpu=neoverse-v2  -fopenmp conftest.c  >&5
gcc: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead
cc1: error: bad value 'neoverse-v2' for '-mtune=' switch
cc1: note: valid arguments to '-mtune=' switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 canno
nlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids emeraldrapids alderlake raptorlake meteorlake graniterapids graniterapids-d arrowlake arrowlak
e-s lunarlake pantherlake bonnell atom silvermont slm goldmont goldmont-plus tremont gracemont sierraforest grandridge clearwaterforest knl knm intel x86-64 eden-x2 nano nano-1000 nano-2000
 nano-3000 nano-x2 eden-x4 nano-x4 lujiazui yongfeng k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 znv
er4 znver5 btver1 btver2 generic native

It seems to me that thorugh the spack process and uenv packaging the compiler uses the neoverse flags for the GH200 node instead of -march=znver4 -mtune=znver4 for the mi300 CPUs. I have to see where this gets injected.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

cscs-ci run beverin

Add F7T_CLIENT_ID and F7T_CLIENT_SECRET variables for build stage.
@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

cscs-ci run beverin

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

cscs-ci run default

1 similar comment
@mtaillefumier
Copy link
Contributor

cscs-ci run default

@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

1 similar comment
@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

1 similar comment
@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

Status update: compiling on the mi300 node works, but running the code fails.

Allocate an mi300 node on beverin:

salloc --nodes=1 --time=01:00:00 --partition=mi300 --gpus-per-node=4

Interactive shell on the compute node for compilation:

srun --uenv=tmlqcd --view=default --pty bash

Compile tmlqcd on the compute node against quda in the uenv:

export CFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export CXXFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export LDFLAGS="-fopenmp"
export CC="$(which mpicc)"
export CXX="$(which mpicxx)"
mkdir -p install_dir
autoconf
./configure \
  --enable-quda_experimental \
  --enable-mpi \
  --enable-omp \
  --with-mpidimension=4 \
  --enable-alignment=32 \
  --with-qudadir="/user-environment/env/default" \
  --with-limedir="/user-environment/env/default" \
  --with-lemondir="/user-environment/env/default" \
  --with-lapack="-lopenblas -L/user-environment/env/default/lib" \
  --with-hipdir="/user-environment/env/default/lib" \
  --prefix="$(pwd)/install_dir"
make
make install

Run on the login node:

srun --uenv=tmlqcd --view=default -n 4 ./install_dir/bin/hmc_tm -f doc/sample-input/sample-hmc-quda-cscs-beverin.input

fails with:

# QUDA: ERROR: hipStreamCreateWithPriority(&streams[i], hipStreamDefault, greatestPriority) returned out of memory
 (/tmp/anfink/spack-stage/spack-stage-quda-git.feature_prefetch2_1.0.0-git.7857-e4a5b7x7tczfsshsilrkl2hrmqvgqkam/spack-src/lib/targets/hip/device.cpp:116 in create_context())
 (rank 3, host nid002920, quda_api.cpp:60 in void quda::target::hip::set_runtime_error(hipError_t, const char *, const char *, const char *, const char *, bool)())
# QUDA:        last kernel called was (name=,volume=,aux=)
# QUDA:        last tune param used was block=(64,1,1), grid=(1,1,1), shared_bytes=0, shared_carve_out=0, aux=(1,1,1,1)

The mapping of the 4 processes to the 4 GPUs is correct.

@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

For future reference some information about the mi300:
rocminfo.txt
rocm-smi.txt
lscpu.txt
numactl.txt

@mtaillefumier
Copy link
Contributor

@chaoos : The ci/cd is properly set. Only I can set it up correctly dues to some restrictions on our side.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

@chaoos : The ci/cd is properly set. Only I can set it up correctly dues to some restrictions on our side.

I see, shall we zoom briefly?

@kostrzewa
Copy link
Member

@mtaillefumier could you please let me know which exact QUDA commit was compiled here? I currently can't compile the develop head commit on Lumi-G (with rocm-6.3.4 or rocm-6.4.4) and have to resort to working with the feature/prefetch2 branch which, however, seems to introduce severe performance regressions.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

@kostrzewa The CI now compiles against the develop branch, which will fail. The test I made above is against the feature/prefetch2 branch, which did compile (hip@6.3.3).

I cannot say anything about performance since I cannot run.

@kostrzewa
Copy link
Member

Thanks, I should have seen that above! tmlqcd/quda-prefetch2:2375574724

@mtaillefumier
Copy link
Contributor

@kostrzewa: I use the commit used in the ci/cd. I am not sure if it help you or not.

@mtaillefumier
Copy link
Contributor

I still need to test the build locally which simplify work a lot.

@chaoos
Copy link
Contributor Author

chaoos commented Mar 11, 2026

@mtaillefumier Well, right now the build failed because of the timelimit of 2h. Nevertheless quda will fail to build because of the issues mentioned by @kostrzewa .

@kostrzewa
Copy link
Member

See also lattice/quda#1604 (comment) where maybe we'll get a reply from the QUDA team or AMD.

Increases the time limit to 8 hours
@mtaillefumier
Copy link
Contributor

cscs-ci run beverin

@kostrzewa
Copy link
Member

The very latest QUDA develop commit should compile fine again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants