CUDA backend + batched rotations on gpu by sverhoeven · Pull Request #107 · haddocking/powerfit

sverhoeven · 2026-04-16T13:16:37Z

This PR adds:

cuda backend
batched rotations on in opencl and cuda backend
disables progressbar by default, can be re-enabled by passing --progressbar (on CPU only)
rewritten tests from unittest to pytest
added regression test, that can be run on gpu

TODO

See https://github.com/haddocking/powerfit/blob/cuda/docs/performance.md for current speed

Performance on master branch -> cuda branch (search time, best batch size)

on AMD 7900 XTX was 13s -> 1.43s
on NVIDIA RTX 3050 was 37s -> 5.75s

…into powerfit-many

Report table

* Add `powerfit_many` function * removed the need to pass through partial results using the filesystem in the multi-processing CPU code. This reduces I/O and makes the code more simple

Add support for gzipped template files

Allow for gzipped target files

Refs #89

The laplace and core-weighted are now turned on by default. A user must be explicit to turn them off. Disabling one or both of them is obviously possible, so does not need extra help text.

Invert --laplace and --core-weighted cli arguments

Fix sigma diff calc

…re fetches Replace manual 8-point trilinear interpolation in all four CUDA rotation kernels (rotate_image3d_linear, rotate_image3d_nearest and their batch variants) with cudaTextureObject_t / tex3D<float> hardware texture fetches, mirroring the existing OpenCL cl.Image3D path. - kernels.cu: use cudaTextureObject_t tex argument; sample via tex3D with normalised coordinates (divided by shape dims) and wrap addressing - cuda.py: add CUDATexture dataclass (holds TextureObject + CUDAarray for lifetime management); add make_cuda_texture_linear and make_cuda_texture_nearerst factory functions; correlators use them for template and mask respectively - cudakernels.py: accept TextureLike protocol (ptr: int); pass np.uint64 texture pointer to kernels; expose nearest flag on rotate_image3d - shared.py: add NvidiaTexture protocol; constrain TypeVar I to (np.ndarray, Image, NvidiaTexture) - tests: add parametrised tests for all four combos of linear/nearest × single/batch asserting identity rotation reproduces the source voxel; relax regression row_count_tolerance from 2 to 3 for hardware rounding

sverhoeven · 2026-05-06T09:10:13Z

Tested new rotation using cuda textures.

Got

m1-notextcuda-bs100-r1,1,5,4.107,100
m1-notextcuda-bs100-r2,2,5,3.952,100
m1-notextcuda-bs100-r3,3,5,4.02,100
m1-notextcuda-bs100-r4,4,5,4.005,100
m1-notextcuda-bs100-r5,5,5,3.963,100
m1-textcuda-bs100-r1,1,5,3.998,100
m1-textcuda-bs100-r2,2,4,3.873,100
m1-textcuda-bs100-r3,3,5,3.885,100
m1-textcuda-bs100-r4,4,5,3.89,100
m1-textcuda-bs100-r5,5,5,3.892,100
m3-notextcuda-bs100-r1,1,2,0.912,100
m3-notextcuda-bs100-r2,2,2,0.632,100
m3-notextcuda-bs100-r3,3,2,0.619,100
m3-notextcuda-bs100-r4,4,2,0.621,100
m3-notextcuda-bs100-r5,5,2,0.62,100
m3-textcuda-bs100-r1,1,2,0.697,100
m3-textcuda-bs100-r2,2,2,0.619,100
m3-textcuda-bs100-r3,3,2,0.623,100
m3-textcuda-bs100-r4,4,2,0.617,100
m3-textcuda-bs100-r5,5,2,0.623,100

On ADA 6000 is does not matter, but on 4050 it is quicker.
The top 38 results are the same as cpu version with 3 swaps and rounding differences in Fish-z on each row
The cuda results now look a lot like opencl results (10 rounding diffs and row 211-213 diff order).

Previous was not using texture based rotate

sverhoeven · 2026-05-08T07:29:36Z

I tried to replace pyvkfft in build_cuda_ffts_batched function with cuda ffts (cupy.), what follows is a llm caveman summary:

Why pyvkfft faster than cupy.fft and cuFFTDx:

Core reason: VkFFT fuse 3-D FFT → single kernel. cuFFT/cuFFTDx decompose → 3 kernels (one per axis). Each extra kernel = full global-memory roundtrip on batch array.

Attempt 1 — cp.fft with explicit plans:

21,955 kernel launches vs VkFFT's 6
3 launches/FFT × 5 FFTs/batch × 708 batches
Result: 8s vs 5.6s (+43%)
nsys confirmed: kernels alone took 6.8s, memcpy gone after out=dst fix

Attempt 2 — cuFFTDx Block mode with pip install nvmath-python[cu13-dx]:

Same 3-launch decomposition as cuFFT → same ~8s expected
Not worth 150 lines of Numba device code to match cuFFT perf
Why Thread mode (fused) can't work:

Thread mode = entire FFT in registers, single kernel, matches VkFFT
Max practical axis size ~32 (register pressure)
Our axes: 35, 35, 45 → too large

Verdict: pyvkfft stays. No NVIDIA-native Python API does fused batched 3-D FFT for axes >32.

Used llm to implement batched cpu correlator following is a caveman summary: Built CPU batched path. Added separate batched helpers, kept old CPU helpers intact. Wired batched mode into single CPU and multi CPU paths. Enabled batch-size on CPU CLI path. Added CPU max batch guesstimate from host RAM. Validation Regression test pass for: nproc 1, batch-size 100 nproc 6, batch-size 100 serial baseline batch-size 0 also pass So correctness same. No result drift. Perf results on real case (70728 rotations, shape 45x35x35) batch-size 0: 2m34-2m35, fastest batch-size 1: 2m48 batch-size 2: 2m50 batch-size 5: 2m47 batch-size 100: 3m05 batch-size 5000: 3m23, huge RAM use (~13 GB) Why slower on CPU GPU batching win from launch amortization + device parallel kernels. CPU different: already heavy time inside native FFT kernels. Batched CPU adds big 4D working sets, more memory traffic, worse cache locality. Larger batches increase RAM pressure, hurt throughput. Extra batch reduction work (argmax/max over chunk) adds overhead. Net: no loop-overhead win big enough to beat memory/throughput penalty. Profiler supports this FFT stages still dominant. rotate_grid3d time almost unchanged. Added batch reduction costs visible. Total runtime increases vs serial. Conclusion CPU batching is dead end for this workload/hardware. Best CPU setting: batch-size 0 (serial CPU correlator). Keep batching for GPU backends only as performance feature. CPU batch support can stay as functional path, but not recommended for speed

BSchilperoort

Some small lingering issues. Why wouldn't we remove the serial correlators for GPU altogether? I do understand that it's nice that the OpenCL serial class looks very similar to the CPU implementation, but I'm not sure if they're required seeing as a batch-size of 1 doesn't have a worse performance than 0.

BSchilperoort · 2026-05-11T13:21:44Z

+        self.conj_multiply_kernel(a, b, out)
+
+    def compute_batch_lcc_score_and_take_best(self, batch_start: int, chunk_size: int):
        block = 256


Still this random number here that I commented on in my previous review.

Added comment with different blocks tested

BSchilperoort · 2026-05-11T13:38:59Z



-class OpenCLSerialCorrelator(Correlator):
+class OpenCLSerialCorrelator(SerialCorrelator):


Some comments on remaining duplicated code;

OpenCL correlator initializers still share 9 lines of logic (2/3rds for Serial.__init__, about half of Batched.__init__.

_set_template_var and _set_mask_var are complete duplicates. Could be inherited with a mix-in class instead...

init_vars can be also put in mix-in class. Just do self.batch_size = None in the serial class, and move the empty_lcc_ft=True logic to within the init_correlator_vars function (only set to true if batch-size is not None).

Alternatively we could also just remove the serial correlator here, I do not see what its use is. On my machine batch-size==1 is actually slightly faster than batch-size==0.

Add batch-size=0 for all machines to performance.md. Also find combi where it is faster than batch size 1.

BSchilperoort · 2026-05-11T13:41:38Z

 * m2: AMD Ryzen 7 7800X3D and AMD Radeon RX 7900 XTX
+* m3: AMD EPYC 9554 and NVIDIA RTX 6000 Ada
+* m4: Intel i7-13700H and NVIDIA RTX 4050 Laptop via WSL



Shall I add results from my machine here, or is it not necessary seeing as you already tested the 7900 XTX (even though my gpu is slightly faster at powerfit 😉 )

Please do in follow up PR

BSchilperoort · 2026-05-11T13:45:43Z

-
-
-class CUDASerialCorrelator(Correlator):
+class CUDASerialCorrelator(SerialCorrelator):


Similar duplication issue as with the OpenCL classes here.

sverhoeven and others added 30 commits October 7, 2025 11:43

From table select molstar snapshot

2396018

Make card and table use same key/header

b95c46f

Disable button in table if fit is being visualized

a112bbb

Link to fit_X.pdb in table

eeccb58

Pass through multi-proc results directly instead of files

67e7c67

Merge branch 'powerfit-many' of https://github.com/haddocking/powerfit …

5a5abf8

…into powerfit-many

Make fish-z key same as header

ac03a8f

Apply review suggestions

19bfa0a

Merge pull request #86 from haddocking/report-table

81ba112

Report table

Close your files

2c5b94c

Merge pull request #83 from haddocking/powerfit-many

741298f

* Add `powerfit_many` function * removed the need to pass through partial results using the filesystem in the multi-processing CPU code. This reduces I/O and makes the code more simple

Merge remote-tracking branch 'origin/master' into gz

aadfd1f

Merge pull request #85 from haddocking/gz

065a5f1

Add support for gzipped template files

Allow for gzipped .map, .ccp4, .mrc files.

3fba2b7

Allow for passing fname instead of BinaryIO

dee7b5b

Add .gz volume test

dc27d29

Bump version to 3.2.0

f29baec

Merge pull request #88 from haddocking/gzip-maps

deee0a7

Allow for gzipped target files

Invert --laplace and --core-weighted cli arguments.

6d31a5e

Refs #89

Bump to v4.0.0

2716fe6

Remove combination remark

d1cd86a

The laplace and core-weighted are now turned on by default. A user must be explicit to turn them off. Disabling one or both of them is obviously possible, so does not need extra help text.

Merge pull request #90 from haddocking/89-invert-l-cw-flags

2ac1222

Invert --laplace and --core-weighted cli arguments

Fix sigma diff calc

2f0afae

Merge pull request #92 from haddocking/sigma-diff-fix

fdfe2b9

Fix sigma diff calc

Merge branch 'master' into 19-unused-code

5bef01a

Remove unused test file

c85051f

Remove rot_search script (hasn't worked since at least 2016)

0389466

Remove unused rotate_grid3d opencl kernel

c5d39f0

Remove kernel test file, it only tested broken rotate_grid CL kernel

6a086eb

Remove unused code/imports

6614b03

sverhoeven added 9 commits April 30, 2026 14:23

More renames

630a862

Use f strings instead of noqa UP031

d00d1f1

Initialize vars for gpu correlators with same method name

79a1764

Centralize init vars

8bfa662

Dedup batch correlators to super class

be5babc

Extract SerialCorrelator from Correlator

3a5803e

Fix line too long

74018b1

More opencl on nvidia runs

bd1c5e4

sverhoeven added 4 commits May 6, 2026 12:28

Combine plots into single plot with interactive legend

4a164ac

Reran m3 cuda batchsizes <=1000

81f3592

Previous was not using texture based rotate

Format

bea3c4b

Remove duplication of compute_* and their calls

ac78ad2

sverhoeven requested a review from BSchilperoort May 6, 2026 10:53

Do not have public alias for private, make public from start

9d2f12b

sverhoeven mentioned this pull request May 11, 2026

Replace c and cython extension with rust + rust correlator #109

Draft

BSchilperoort reviewed May 11, 2026

View reviewed changes

sverhoeven mentioned this pull request May 12, 2026

Drop serial gpu correlators? #111

Open

sverhoeven added 6 commits May 12, 2026 10:24

Rerun performance runs on m4 + plot bs=0=serial

b420fd8

Test different block sizes on cuda 4050

20bc894

Test block size on opencl

049e2dc

Include bs=0=serial from m3 in plot

d45c8e0

Add hint for --batch-size=0

d4b5ad0

Deduplicate gpu correlator and opencl kernels

be6035a

BSchilperoort approved these changes May 12, 2026

View reviewed changes

sverhoeven merged commit 9a11eed into master May 12, 2026
11 checks passed

sverhoeven mentioned this pull request May 12, 2026

make v5 release #112

Closed



		class OpenCLSerialCorrelator(Correlator):
		class OpenCLSerialCorrelator(SerialCorrelator):



		class CUDASerialCorrelator(Correlator):
		class CUDASerialCorrelator(SerialCorrelator):

Conversation

sverhoeven commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sverhoeven commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sverhoeven commented May 8, 2026

Uh oh!

BSchilperoort left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sverhoeven commented Apr 16, 2026 •

edited

Loading

sverhoeven commented May 6, 2026 •

edited

Loading