Add ROCm support for xobjects. #166

ekatralis · 2025-12-10T13:22:23Z

Description

This pull request adds rocm support for xobjects when using ContextCupy(). Includes changes to the headers so that they are compatible with the rocm definitions. This has been tested in the following configuration:

ROCm 6.2.2
Python 3.11
CuPy 13.6.0 compiled from source for a HIP backend.

CuPy can be configured as follows:

export ROCM_HOME=/opt/rocm
export HIPCC="$ROCM_HOME/bin/hipcc"
export CXX="$HIPCC"
export PATH="$ROCM_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$ROCM_HOME/lib:$ROCM_HOME/lib64:${LD_LIBRARY_PATH}"

export HCC_AMDGPU_TARGET=gfx906 # for gpu in pcbe15600
export CUPY_INSTALL_USE_HIP=1


pip install --no-cache-dir --force-reinstall "cupy==13.6.0"

xobjects tets are passing. xtrack tests are passing as well.

Checklist

Mandatory:

All the tests are passing, including my new ones
I described my changes in this PR description
Investigate VRAM not being freed

Optional:

The code I wrote follows good style practices (see PEP 8 and PEP 20).
I have updated the docs in relation to my changes, if applicable
I have tested also GPU contexts -> Doesn't break CUDA compatibility

ekatralis · 2025-12-10T13:25:45Z

TODO: Test the same setup on different PCs with different rocm versions. Ideally with newer GPUs that support ROCm 7, which allows for this pre-built wheel to be used, significantly reducing installation complexity:

https://rocm.blogs.amd.com/artificial-intelligence/cupy-v13/README.html

ekatralis · 2025-12-10T22:26:38Z

DONE: Add documentation on procedure to set up ROCm and build CuPy from source in the xsuite docs

ekatralis · 2025-12-11T09:38:55Z

BUG: When running pytest, the memory is not being freed-up in between tests. Patch might be required for this.

EDIT: This appears to happen on nvidia as well

ekatralis · 2025-12-15T13:57:41Z

Related:
xsuite/xsuite#754

szymonlopaciuk

This is very good, I don't see why we shouldn't merge it as-is, as it definitely won't disturb the current workflows.

eltos · 2026-01-15T15:36:35Z

Successfully tested this on GSI HPC with AMD MI100 and ROCm version 6.8.5 using the container prepared by @ekatralis here:
https://github.com/ekatralis/xsuite-on-gsi/blob/039347560ef573493c2e98f1b65048f86d6a2bc2/xsuite_amdrocm.def

Lattice is a simple FODO lattice (Drift, Multipole, Drift, Multipole), tracking 1e+06 particles over 1000 turns, experiment repeated N=5 times for uncertainty.

Context	Tracking time
CuPy	(2.338 ± 0.001) s
OpenCL	(0.975 ± 0.002) s
CPU	93 s

Full output

(xsuite-env) /lustre/hes/pniederm/example.py cupy
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')
CuPy: available devices (8):
  0 AMD Instinct MI100
  1 AMD Instinct MI100
  2 AMD Instinct MI100
  3 AMD Instinct MI100
  4 AMD Instinct MI100
  5 AMD Instinct MI100
  6 AMD Instinct MI100
  7 AMD Instinct MI100
Using device: 0
Using context: ContextCupy:0
Setting up a simple FODO lattice...
Setup completed in: 19.60333507298492 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 2.338743943022564 s
Test passed

(xsuite-env) /lustre/hes/pniederm/example.py opencl
OpenCL: available platforms (2):
  0 AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
    OpenCL 2.1 AMD-APP (3635.0)
    0.0 GPU: gfx908:sramecc+:xnack-
    0.1 GPU: gfx908:sramecc+:xnack-
    0.2 GPU: gfx908:sramecc+:xnack-
    0.3 GPU: gfx908:sramecc+:xnack-
    0.4 GPU: gfx908:sramecc+:xnack-
    0.5 GPU: gfx908:sramecc+:xnack-
    0.6 GPU: gfx908:sramecc+:xnack-
    0.7 GPU: gfx908:sramecc+:xnack-
  1 Portable Computing Language (The pocl project)
    OpenCL 3.0 PoCL 5.0+debian  Linux, None+Asserts, RELOC, SPIR, LLVM 16.0.6, SLEEF, DISTRO, POCL_DEBUG
    1.0 CPU: cpu-haswell-AMD EPYC 7413 24-Core Processor
Using device: 0.0

/usr/lib/python3/dist-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  _create_built_program_from_source_cached(

Using context: ContextPyopencl:0.0
Setting up a simple FODO lattice...
Setup completed in: 0.42373332707211375 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 0.9755474431440234 s
Test passed

(xsuite-env) /lustre/hes/pniederm/example.py cpu

Using context: ContextCpu
Setting up a simple FODO lattice...
Setup completed in: 0.37141815200448036 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 93.4150049739983 s

rdemaria · 2026-01-15T17:10:17Z

Very interesting! The speed of cupy us promising but still painful. Do you have some explanation? Can you do the same excessive, but with nvidia?

ekatralis · 2026-01-15T20:15:48Z

Very interesting! The speed of cupy us promising but still painful. Do you have some explanation? Can you do the same excessive, but with nvidia?

A plausible explanation for these results could be that we are using an older version of ROCm (6.x) and building CuPy from source. On ROCm 7.x AMD has their own cupy fork (which is supposed to be merged in v14) which should offer improved performance:

https://rocm.blogs.amd.com/artificial-intelligence/cupy-v13/README.html

I repeated the same test on a Titan V (TR 2970WX for CPU) for Nvidia using the same methodology (average over 5 runs):

Method	Time
CPU	100 s
OpenCL	0.524 ± 0.010 s
CuPy	0.495 ± 0.007 s

Full output

(gpu_cf) ekatrali@pcbe-abp-gpu001:~/GPU_dev/xsuite-on-gsi$ python example.py cpu
/home/ekatrali/anaconda3/envs/gpu_cf/lib/python3.11/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCpu
Setup completed in: 0.58940225886181 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 100.62413182435557 s
Test passed

(gpu_cf) ekatrali@pcbe-abp-gpu001:~/GPU_dev/xsuite-on-gsi$ python example.py opencl
/home/ekatrali/anaconda3/envs/gpu_cf/lib/python3.11/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')
OpenCL: available platforms (4):
  0 NVIDIA CUDA (NVIDIA Corporation)
    OpenCL 3.0 CUDA 11.4.557
    0.0 GPU: NVIDIA TITAN V
    0.1 GPU: NVIDIA TITAN V
    0.2 GPU: NVIDIA TITAN V
    0.3 GPU: NVIDIA TITAN V
  1 Intel(R) CPU Runtime for OpenCL(TM) Applications (Intel(R) Corporation)
    OpenCL 2.1 LINUX
    1.0 CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
  2 Portable Computing Language (The pocl project)
    OpenCL 1.2 pocl 1.4, None+Asserts, LLVM 9.0.1, RELOC, SLEEF, DISTRO, POCL_DEBUG
    2.0 CPU: pthread-AMD Ryzen Threadripper 2970WX 24-Core Processor
  3 Intel(R) CPU Runtime for OpenCL(TM) Applications (Intel(R) Corporation)
    OpenCL 2.1 LINUX
    3.0 CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
Using device: 0.0 


Using context: ContextPyopencl:0.0
Setup completed in: 0.5435653221793473 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 0.5145912640728056 s
Test passed

(gpu_cf) ekatrali@pcbe-abp-gpu001:~/GPU_dev/xsuite-on-gsi$ python example.py cupy
/home/ekatrali/anaconda3/envs/gpu_cf/lib/python3.11/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCupy:0
Setup completed in: 0.8388642007485032 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 0.4981880891136825 s
Test passed

For reference here is the same test on a Radeon VII (TR 1950X CPU) as well:

Method	Time
CPU	144 s
OpenCL	1.795 ± 0.003 s
CuPy	4.631 ± 0.004 s

Full output

(xsuite-env) python example.py cpu
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCpu
Setup completed in: 0.6376619641669095 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 144.56509774830192 s
Test passed

(xsuite-env) python example.py opencl
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')
OpenCL: available platforms (2):
  0 AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
    OpenCL 2.1 AMD-APP (3649.0)
    0.0 GPU: gfx906:sramecc+:xnack-
  1 Portable Computing Language (The pocl project)
    OpenCL 3.0 PoCL 5.0+debian  Linux, None+Asserts, RELOC, SPIR, LLVM 16.0.6, SLEEF, DISTRO, POCL_DEBUG
    1.0 CPU: cpu-haswell-AMD Ryzen Threadripper 1950X 16-Core Processor
Using device: 0.0 

/usr/lib/python3/dist-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  _create_built_program_from_source_cached(

Using context: ContextPyopencl:0.0
Setup completed in: 0.6063676928170025 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 1.7923616538755596 s
Test passed

(xsuite-env) python example.py cupy
/xsuite-env/lib/python3.12/site-packages/cupyx/jit/_interface.py:173: FutureWarning: cupyx.jit.rawkernel is experimental. The interface can change in the future.
  cupy._util.experimental('cupyx.jit.rawkernel')

Using context: ContextCupy:0
Setup completed in: 1.3635484268888831 s
Tracking 1e+06 particles over 1000 turns...
Tracking completed in: 4.629076654091477 s
Test passed

ekatralis added 2 commits December 10, 2025 13:56

Add support for ROCm in ContextCupy

22e2b29

Add definition for NULL, missing for ROCm

4f87acb

ekatralis mentioned this pull request Dec 11, 2025

Update docs with installation instructions on ROCm xsuite/xsuite#754

Open

Clean up and add comments

19258bd

Merge branch 'xsuite:main' into RocmSupportMain

b6e3d27

szymonlopaciuk approved these changes Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ROCm support for xobjects. #166

Add ROCm support for xobjects. #166

Uh oh!

ekatralis commented Dec 10, 2025 •

edited

Loading

Uh oh!

ekatralis commented Dec 10, 2025

Uh oh!

ekatralis commented Dec 10, 2025 •

edited

Loading

Uh oh!

ekatralis commented Dec 11, 2025 •

edited

Loading

Uh oh!

ekatralis commented Dec 15, 2025

Uh oh!

szymonlopaciuk left a comment

Uh oh!

eltos commented Jan 15, 2026 •

edited

Loading

Uh oh!

rdemaria commented Jan 15, 2026

Uh oh!

ekatralis commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add ROCm support for xobjects. #166

Are you sure you want to change the base?

Add ROCm support for xobjects. #166

Uh oh!

Conversation

ekatralis commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

ekatralis commented Dec 10, 2025

Uh oh!

ekatralis commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekatralis commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekatralis commented Dec 15, 2025

Uh oh!

szymonlopaciuk left a comment

Choose a reason for hiding this comment

Uh oh!

eltos commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdemaria commented Jan 15, 2026

Uh oh!

ekatralis commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ekatralis commented Dec 10, 2025 •

edited

Loading

ekatralis commented Dec 10, 2025 •

edited

Loading

ekatralis commented Dec 11, 2025 •

edited

Loading

eltos commented Jan 15, 2026 •

edited

Loading