Skip to content

Conversation

@ekatralis
Copy link

@ekatralis ekatralis commented Dec 10, 2025

Description

This pull request adds rocm support for xobjects when using ContextCupy(). Includes changes to the headers so that they are compatible with the rocm definitions. This has been tested in the following configuration:

  • ROCm 6.2.2
  • Python 3.11
  • CuPy 13.6.0 compiled from source for a HIP backend.

CuPy can be configured as follows:

export ROCM_HOME=/opt/rocm
export HIPCC="$ROCM_HOME/bin/hipcc"
export CXX="$HIPCC"
export PATH="$ROCM_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$ROCM_HOME/lib:$ROCM_HOME/lib64:${LD_LIBRARY_PATH}"

export HCC_AMDGPU_TARGET=gfx906 # for gpu in pcbe15600
export CUPY_INSTALL_USE_HIP=1


pip install --no-cache-dir --force-reinstall "cupy==13.6.0"

xobjects tets are passing. xtrack tests are passing as well.

Checklist

Mandatory:

  • All the tests are passing, including my new ones
  • I described my changes in this PR description
  • Investigate VRAM not being freed

Optional:

  • The code I wrote follows good style practices (see PEP 8 and PEP 20).
  • I have updated the docs in relation to my changes, if applicable
  • I have tested also GPU contexts -> Doesn't break CUDA compatibility

@ekatralis
Copy link
Author

TODO: Test the same setup on different PCs with different rocm versions. Ideally with newer GPUs that support ROCm 7, which allows for this pre-built wheel to be used, significantly reducing installation complexity:

https://rocm.blogs.amd.com/artificial-intelligence/cupy-v13/README.html

@ekatralis
Copy link
Author

ekatralis commented Dec 10, 2025

DONE: Add documentation on procedure to set up ROCm and build CuPy from source in the xsuite docs

@ekatralis
Copy link
Author

ekatralis commented Dec 11, 2025

BUG: When running pytest, the memory is not being freed-up in between tests. Patch might be required for this.

EDIT: This appears to happen on nvidia as well

@ekatralis
Copy link
Author

Related:
xsuite/xsuite#754

Copy link
Contributor

@szymonlopaciuk szymonlopaciuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good, I don't see why we shouldn't merge it as-is, as it definitely won't disturb the current workflows.

@eltos
Copy link
Member

eltos commented Jan 15, 2026

Successfully tested this on GSI HPC with AMD MI100 and ROCm version 6.8.5 using the container prepared by @ekatralis here:
https://github.com/ekatralis/xsuite-on-gsi/blob/039347560ef573493c2e98f1b65048f86d6a2bc2/xsuite_amdrocm.def

Lattice is a simple FODO lattice (Drift, Multipole, Drift, Multipole), tracking 1e+06 particles over 1000 turns, experiment repeated N=5 times for uncertainty.

Context Tracking time
CuPy (2.338 ± 0.001) s
OpenCL (0.975 ± 0.002) s
CPU 93 s
Full output
(xsuite-env) /lustre/hes/pniederm/example.py cupy
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')
CuPy: available devices (8):
  0 AMD Instinct MI100
  1 AMD Instinct MI100
  2 AMD Instinct MI100
  3 AMD Instinct MI100
  4 AMD Instinct MI100
  5 AMD Instinct MI100
  6 AMD Instinct MI100
  7 AMD Instinct MI100
Using device: 0
Using context: ContextCupy:0
Setting up a simple FODO lattice...
Setup completed in: 19.60333507298492 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 2.338743943022564 s
Test passed
(xsuite-env) /lustre/hes/pniederm/example.py opencl
OpenCL: available platforms (2):
  0 AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
    OpenCL 2.1 AMD-APP (3635.0)
    0.0 GPU: gfx908:sramecc+:xnack-
    0.1 GPU: gfx908:sramecc+:xnack-
    0.2 GPU: gfx908:sramecc+:xnack-
    0.3 GPU: gfx908:sramecc+:xnack-
    0.4 GPU: gfx908:sramecc+:xnack-
    0.5 GPU: gfx908:sramecc+:xnack-
    0.6 GPU: gfx908:sramecc+:xnack-
    0.7 GPU: gfx908:sramecc+:xnack-
  1 Portable Computing Language (The pocl project)
    OpenCL 3.0 PoCL 5.0+debian  Linux, None+Asserts, RELOC, SPIR, LLVM 16.0.6, SLEEF, DISTRO, POCL_DEBUG
    1.0 CPU: cpu-haswell-AMD EPYC 7413 24-Core Processor
Using device: 0.0

/usr/lib/python3/dist-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  _create_built_program_from_source_cached(

Using context: ContextPyopencl:0.0
Setting up a simple FODO lattice...
Setup completed in: 0.42373332707211375 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 0.9755474431440234 s
Test passed
(xsuite-env) /lustre/hes/pniederm/example.py cpu

Using context: ContextCpu
Setting up a simple FODO lattice...
Setup completed in: 0.37141815200448036 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 93.4150049739983 s

@rdemaria
Copy link
Collaborator

Very interesting! The speed of cupy us promising but still painful. Do you have some explanation? Can you do the same excessive, but with nvidia?

@ekatralis
Copy link
Author

Very interesting! The speed of cupy us promising but still painful. Do you have some explanation? Can you do the same excessive, but with nvidia?

A plausible explanation for these results could be that we are using an older version of ROCm (6.x) and building CuPy from source. On ROCm 7.x AMD has their own cupy fork (which is supposed to be merged in v14) which should offer improved performance:

https://rocm.blogs.amd.com/artificial-intelligence/cupy-v13/README.html

I repeated the same test on a Titan V (TR 2970WX for CPU) for Nvidia using the same methodology (average over 5 runs):

Method Time
CPU 100 s
OpenCL 0.524 ± 0.010 s
CuPy 0.495 ± 0.007 s
Full output
(gpu_cf) ekatrali@pcbe-abp-gpu001:~/GPU_dev/xsuite-on-gsi$ python example.py cpu
/home/ekatrali/anaconda3/envs/gpu_cf/lib/python3.11/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCpu
Setup completed in: 0.58940225886181 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 100.62413182435557 s
Test passed

(gpu_cf) ekatrali@pcbe-abp-gpu001:~/GPU_dev/xsuite-on-gsi$ python example.py opencl
/home/ekatrali/anaconda3/envs/gpu_cf/lib/python3.11/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')
OpenCL: available platforms (4):
  0 NVIDIA CUDA (NVIDIA Corporation)
    OpenCL 3.0 CUDA 11.4.557
    0.0 GPU: NVIDIA TITAN V
    0.1 GPU: NVIDIA TITAN V
    0.2 GPU: NVIDIA TITAN V
    0.3 GPU: NVIDIA TITAN V
  1 Intel(R) CPU Runtime for OpenCL(TM) Applications (Intel(R) Corporation)
    OpenCL 2.1 LINUX
    1.0 CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
  2 Portable Computing Language (The pocl project)
    OpenCL 1.2 pocl 1.4, None+Asserts, LLVM 9.0.1, RELOC, SLEEF, DISTRO, POCL_DEBUG
    2.0 CPU: pthread-AMD Ryzen Threadripper 2970WX 24-Core Processor
  3 Intel(R) CPU Runtime for OpenCL(TM) Applications (Intel(R) Corporation)
    OpenCL 2.1 LINUX
    3.0 CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
Using device: 0.0 


Using context: ContextPyopencl:0.0
Setup completed in: 0.5435653221793473 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 0.5145912640728056 s
Test passed

(gpu_cf) ekatrali@pcbe-abp-gpu001:~/GPU_dev/xsuite-on-gsi$ python example.py cupy
/home/ekatrali/anaconda3/envs/gpu_cf/lib/python3.11/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCupy:0
Setup completed in: 0.8388642007485032 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 0.4981880891136825 s
Test passed

For reference here is the same test on a Radeon VII (TR 1950X CPU) as well:

Method Time
CPU 144 s
OpenCL 1.795 ± 0.003 s
CuPy 4.631 ± 0.004 s
Full output
(xsuite-env) python example.py cpu
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCpu
Setup completed in: 0.6376619641669095 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 144.56509774830192 s
Test passed

(xsuite-env) python example.py opencl
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')
OpenCL: available platforms (2):
  0 AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
    OpenCL 2.1 AMD-APP (3649.0)
    0.0 GPU: gfx906:sramecc+:xnack-
  1 Portable Computing Language (The pocl project)
    OpenCL 3.0 PoCL 5.0+debian  Linux, None+Asserts, RELOC, SPIR, LLVM 16.0.6, SLEEF, DISTRO, POCL_DEBUG
    1.0 CPU: cpu-haswell-AMD Ryzen Threadripper 1950X 16-Core Processor
Using device: 0.0 

/usr/lib/python3/dist-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  _create_built_program_from_source_cached(

Using context: ContextPyopencl:0.0
Setup completed in: 0.6063676928170025 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 1.7923616538755596 s
Test passed

(xsuite-env) python example.py cupy
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCupy:0
Setup completed in: 1.3635484268888831 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 4.629076654091477 s
Test passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants