Skip to content

CUDA backend + batched rotations on gpu#107

Merged
sverhoeven merged 500 commits into
masterfrom
cuda
May 12, 2026
Merged

CUDA backend + batched rotations on gpu#107
sverhoeven merged 500 commits into
masterfrom
cuda

Conversation

@sverhoeven
Copy link
Copy Markdown
Collaborator

@sverhoeven sverhoeven commented Apr 16, 2026

This PR adds:

  • cuda backend
  • batched rotations on in opencl and cuda backend
  • disables progressbar by default, can be re-enabled by passing --progressbar (on CPU only)
  • rewritten tests from unittest to pytest
  • added regression test, that can be run on gpu

TODO

  • same result as cpu
  • Document installation
  • Document usage
  • Compare cuda vs opencl
  • review/declutter/cleanup llm generated code
  • Add hip, and compare with opencl - pyvkfft does not support hip, need to fallback to cupy.fft
  • Docker image with cuda support
  • check locally build wheel works, before publishing to pypi
  • check conda installation
  • port improvements from cuda back to opencl correlator/fft like planned fft, see inline comment
  • check if using cupy.fft is faster than pyvkfft.cuda
  • no regressions, same speed and output
    • single cpu
    • multi cpu
    • amd gpu
    • intel gpu
    • nvidia gpu on cuda
    • nvidia gpu on opencl

See https://github.com/haddocking/powerfit/blob/cuda/docs/performance.md for current speed

Performance on master branch -> cuda branch (search time, best batch size)

  • on AMD 7900 XTX was 13s -> 1.43s
  • on NVIDIA RTX 3050 was 37s -> 5.75s

sverhoeven and others added 30 commits October 7, 2025 11:43
* Add `powerfit_many` function

* removed the need to pass through partial results using the filesystem in the multi-processing CPU code. This reduces I/O and makes the code more simple
Add support for gzipped template files
Allow for gzipped target files
The laplace and core-weighted are now turned on by default.
A user must be explicit to turn them off.
Disabling one or both of them is obviously possible, so does not need extra help text.
Invert --laplace and --core-weighted cli arguments
…re fetches

Replace manual 8-point trilinear interpolation in all four CUDA rotation
kernels (rotate_image3d_linear, rotate_image3d_nearest and their batch
variants) with cudaTextureObject_t / tex3D<float> hardware texture fetches,
mirroring the existing OpenCL cl.Image3D path.

- kernels.cu: use cudaTextureObject_t tex argument; sample via tex3D with
  normalised coordinates (divided by shape dims) and wrap addressing
- cuda.py: add CUDATexture dataclass (holds TextureObject + CUDAarray for
  lifetime management); add make_cuda_texture_linear and
  make_cuda_texture_nearerst factory functions; correlators use them for
  template and mask respectively
- cudakernels.py: accept TextureLike protocol (ptr: int); pass np.uint64
  texture pointer to kernels; expose nearest flag on rotate_image3d
- shared.py: add NvidiaTexture protocol; constrain TypeVar I to
  (np.ndarray, Image, NvidiaTexture)
- tests: add parametrised tests for all four combos of linear/nearest ×
  single/batch asserting identity rotation reproduces the source voxel;
  relax regression row_count_tolerance from 2 to 3 for hardware rounding
@sverhoeven
Copy link
Copy Markdown
Collaborator Author

sverhoeven commented May 6, 2026

Tested new rotation using cuda textures.

Got

m1-notextcuda-bs100-r1,1,5,4.107,100
m1-notextcuda-bs100-r2,2,5,3.952,100
m1-notextcuda-bs100-r3,3,5,4.02,100
m1-notextcuda-bs100-r4,4,5,4.005,100
m1-notextcuda-bs100-r5,5,5,3.963,100
m1-textcuda-bs100-r1,1,5,3.998,100
m1-textcuda-bs100-r2,2,4,3.873,100
m1-textcuda-bs100-r3,3,5,3.885,100
m1-textcuda-bs100-r4,4,5,3.89,100
m1-textcuda-bs100-r5,5,5,3.892,100
m3-notextcuda-bs100-r1,1,2,0.912,100
m3-notextcuda-bs100-r2,2,2,0.632,100
m3-notextcuda-bs100-r3,3,2,0.619,100
m3-notextcuda-bs100-r4,4,2,0.621,100
m3-notextcuda-bs100-r5,5,2,0.62,100
m3-textcuda-bs100-r1,1,2,0.697,100
m3-textcuda-bs100-r2,2,2,0.619,100
m3-textcuda-bs100-r3,3,2,0.623,100
m3-textcuda-bs100-r4,4,2,0.617,100
m3-textcuda-bs100-r5,5,2,0.623,100
image

On ADA 6000 is does not matter, but on 4050 it is quicker.
The top 38 results are the same as cpu version with 3 swaps and rounding differences in Fish-z on each row
The cuda results now look a lot like opencl results (10 rounding diffs and row 211-213 diff order).

@sverhoeven sverhoeven requested a review from BSchilperoort May 6, 2026 10:53
@sverhoeven
Copy link
Copy Markdown
Collaborator Author

I tried to replace pyvkfft in build_cuda_ffts_batched function with cuda ffts (cupy.), what follows is a llm caveman summary:

Why pyvkfft faster than cupy.fft and cuFFTDx:

Core reason: VkFFT fuse 3-D FFT → single kernel. cuFFT/cuFFTDx decompose → 3 kernels (one per axis). Each extra kernel = full global-memory roundtrip on batch array.

Attempt 1 — cp.fft with explicit plans:

21,955 kernel launches vs VkFFT's 6
3 launches/FFT × 5 FFTs/batch × 708 batches
Result: 8s vs 5.6s (+43%)
nsys confirmed: kernels alone took 6.8s, memcpy gone after out=dst fix

Attempt 2 — cuFFTDx Block mode with pip install nvmath-python[cu13-dx]:

Same 3-launch decomposition as cuFFT → same ~8s expected
Not worth 150 lines of Numba device code to match cuFFT perf
Why Thread mode (fused) can't work:

Thread mode = entire FFT in registers, single kernel, matches VkFFT
Max practical axis size ~32 (register pressure)
Our axes: 35, 35, 45 → too large

Verdict: pyvkfft stays. No NVIDIA-native Python API does fused batched 3-D FFT for axes >32.

Used llm to implement batched cpu correlator following is a caveman summary:

Built CPU batched path.
Added separate batched helpers, kept old CPU helpers intact.
Wired batched mode into single CPU and multi CPU paths.
Enabled batch-size on CPU CLI path.
Added CPU max batch guesstimate from host RAM.
Validation

Regression test pass for:
nproc 1, batch-size 100
nproc 6, batch-size 100
serial baseline batch-size 0 also pass
So correctness same. No result drift.
Perf results on real case (70728 rotations, shape 45x35x35)

batch-size 0: 2m34-2m35, fastest
batch-size 1: 2m48
batch-size 2: 2m50
batch-size 5: 2m47
batch-size 100: 3m05
batch-size 5000: 3m23, huge RAM use (~13 GB)
Why slower on CPU

GPU batching win from launch amortization + device parallel kernels.
CPU different: already heavy time inside native FFT kernels.
Batched CPU adds big 4D working sets, more memory traffic, worse cache locality.
Larger batches increase RAM pressure, hurt throughput.
Extra batch reduction work (argmax/max over chunk) adds overhead.
Net: no loop-overhead win big enough to beat memory/throughput penalty.
Profiler supports this

FFT stages still dominant.
rotate_grid3d time almost unchanged.
Added batch reduction costs visible.
Total runtime increases vs serial.

Conclusion

CPU batching is dead end for this workload/hardware.
Best CPU setting: batch-size 0 (serial CPU correlator).
Keep batching for GPU backends only as performance feature.
CPU batch support can stay as functional path, but not recommended for speed
Copy link
Copy Markdown
Collaborator

@BSchilperoort BSchilperoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small lingering issues. Why wouldn't we remove the serial correlators for GPU altogether? I do understand that it's nice that the OpenCL serial class looks very similar to the CPU implementation, but I'm not sure if they're required seeing as a batch-size of 1 doesn't have a worse performance than 0.

Comment thread src/powerfit_em/correlators/cuda.py Outdated
self.conj_multiply_kernel(a, b, out)

def compute_batch_lcc_score_and_take_best(self, batch_start: int, chunk_size: int):
block = 256
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still this random number here that I commented on in my previous review.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment with different blocks tested

Comment thread src/powerfit_em/correlators/opencl.py Outdated


class OpenCLSerialCorrelator(Correlator):
class OpenCLSerialCorrelator(SerialCorrelator):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on remaining duplicated code;

  • OpenCL correlator initializers still share 9 lines of logic (2/3rds for Serial.__init__, about half of Batched.__init__.
  • _set_template_var and _set_mask_var are complete duplicates. Could be inherited with a mix-in class instead...
  • init_vars can be also put in mix-in class. Just do self.batch_size = None in the serial class, and move the empty_lcc_ft=True logic to within the init_correlator_vars function (only set to true if batch-size is not None).

Alternatively we could also just remove the serial correlator here, I do not see what its use is. On my machine batch-size==1 is actually slightly faster than batch-size==0.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add batch-size=0 for all machines to performance.md. Also find combi where it is faster than batch size 1.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #111

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedup done

Comment thread docs/performance.md Outdated
Comment thread docs/performance.md
* m2: AMD Ryzen 7 7800X3D and AMD Radeon RX 7900 XTX
* m3: AMD EPYC 9554 and NVIDIA RTX 6000 Ada
* m4: Intel i7-13700H and NVIDIA RTX 4050 Laptop via WSL

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I add results from my machine here, or is it not necessary seeing as you already tested the 7900 XTX (even though my gpu is slightly faster at powerfit 😉 )

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do in follow up PR

Comment thread src/powerfit_em/correlators/cuda.py Outdated


class CUDASerialCorrelator(Correlator):
class CUDASerialCorrelator(SerialCorrelator):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar duplication issue as with the OpenCL classes here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sverhoeven sverhoeven merged commit 9a11eed into master May 12, 2026
11 checks passed
@sverhoeven sverhoeven mentioned this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants