Conversation
|
cscs-ci run beverin |
|
This currently fails due to missing access to the beverin test system at CSCS. |
|
A manual build using quda branch I used this command on beverin: uenv build .ci/uenv-recipes/tmlqcd/beverin-mi300 tmlqcd/quda-prefetch2@beverin%mi300with this spack spec for quda; The image is available in the service namespace of CSCSs uenv registry: $ uenv image find service::
uenv arch system id size(MB) date
tmlqcd/quda-prefetch2:2375574724 mi300 beverin 8f0acefe49988d34 3,857 2026-03-10But the gcc compiler in the uenv is broken 😵: |
|
I learned that the mi300 cluster uses a different authentification mechanism that's why it is failing. |
|
cscs-ci run beverin |
it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Donig the reverse would work. |
it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work. |
1 similar comment
it is a sign that the code was compiled on mi300 and executed on mi250 nodes. Doing the reverse would work. |
I see that makes sense. Apparently, I started the uenv on the login node which has mi200s. I did start now on a compute node with mi300s and the compiler seems to work. Next problem is that cmake's C compiler test fails: It seems to me that thorugh the spack process and uenv packaging the compiler uses the neoverse flags for the GH200 node instead of |
|
cscs-ci run beverin |
Add F7T_CLIENT_ID and F7T_CLIENT_SECRET variables for build stage.
|
cscs-ci run beverin |
|
cscs-ci run default |
1 similar comment
|
cscs-ci run default |
|
cscs-ci run beverin |
1 similar comment
|
cscs-ci run beverin |
fix typo
|
cscs-ci run beverin |
1 similar comment
|
cscs-ci run beverin |
|
Status update: compiling on the mi300 node works, but running the code fails. Allocate an mi300 node on beverin: salloc --nodes=1 --time=01:00:00 --partition=mi300 --gpus-per-node=4Interactive shell on the compute node for compilation: srun --uenv=tmlqcd --view=default --pty bashCompile tmlqcd on the compute node against quda in the uenv: export CFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export CXXFLAGS="-O3 -fopenmp -mtune=znver4 -mcpu=znver4"
export LDFLAGS="-fopenmp"
export CC="$(which mpicc)"
export CXX="$(which mpicxx)"
mkdir -p install_dir
autoconf
./configure \
--enable-quda_experimental \
--enable-mpi \
--enable-omp \
--with-mpidimension=4 \
--enable-alignment=32 \
--with-qudadir="/user-environment/env/default" \
--with-limedir="/user-environment/env/default" \
--with-lemondir="/user-environment/env/default" \
--with-lapack="-lopenblas -L/user-environment/env/default/lib" \
--with-hipdir="/user-environment/env/default/lib" \
--prefix="$(pwd)/install_dir"
make
make installRun on the login node: srun --uenv=tmlqcd --view=default -n 4 ./install_dir/bin/hmc_tm -f doc/sample-input/sample-hmc-quda-cscs-beverin.inputfails with: The mapping of the 4 processes to the 4 GPUs is correct. |
|
cscs-ci run beverin |
|
For future reference some information about the mi300: |
|
@chaoos : The ci/cd is properly set. Only I can set it up correctly dues to some restrictions on our side. |
I see, shall we zoom briefly? |
|
@mtaillefumier could you please let me know which exact QUDA commit was compiled here? I currently can't compile the develop head commit on Lumi-G (with rocm-6.3.4 or rocm-6.4.4) and have to resort to working with the feature/prefetch2 branch which, however, seems to introduce severe performance regressions. |
|
@kostrzewa The CI now compiles against the I cannot say anything about performance since I cannot run. |
|
Thanks, I should have seen that above! |
|
@kostrzewa: I use the commit used in the ci/cd. I am not sure if it help you or not. |
|
I still need to test the build locally which simplify work a lot. |
|
@mtaillefumier Well, right now the build failed because of the timelimit of 2h. Nevertheless quda will fail to build because of the issues mentioned by @kostrzewa . |
|
See also lattice/quda#1604 (comment) where maybe we'll get a reply from the QUDA team or AMD. |
Increases the time limit to 8 hours
|
cscs-ci run beverin |
|
The very latest QUDA develop commit should compile fine again. |
This PR adds requirements on the side of tmLQCD for CI/CD testing on the CSCS test system "beverin". This system hosts AMD MI300A GPUs.