CUDA backend + batched rotations on gpu#107
Conversation
…into powerfit-many
Report table
* Add `powerfit_many` function * removed the need to pass through partial results using the filesystem in the multi-processing CPU code. This reduces I/O and makes the code more simple
Add support for gzipped template files
Allow for gzipped target files
The laplace and core-weighted are now turned on by default. A user must be explicit to turn them off. Disabling one or both of them is obviously possible, so does not need extra help text.
Invert --laplace and --core-weighted cli arguments
Fix sigma diff calc
…re fetches Replace manual 8-point trilinear interpolation in all four CUDA rotation kernels (rotate_image3d_linear, rotate_image3d_nearest and their batch variants) with cudaTextureObject_t / tex3D<float> hardware texture fetches, mirroring the existing OpenCL cl.Image3D path. - kernels.cu: use cudaTextureObject_t tex argument; sample via tex3D with normalised coordinates (divided by shape dims) and wrap addressing - cuda.py: add CUDATexture dataclass (holds TextureObject + CUDAarray for lifetime management); add make_cuda_texture_linear and make_cuda_texture_nearerst factory functions; correlators use them for template and mask respectively - cudakernels.py: accept TextureLike protocol (ptr: int); pass np.uint64 texture pointer to kernels; expose nearest flag on rotate_image3d - shared.py: add NvidiaTexture protocol; constrain TypeVar I to (np.ndarray, Image, NvidiaTexture) - tests: add parametrised tests for all four combos of linear/nearest × single/batch asserting identity rotation reproduces the source voxel; relax regression row_count_tolerance from 2 to 3 for hardware rounding
Previous was not using texture based rotate
|
I tried to replace pyvkfft in build_cuda_ffts_batched function with cuda ffts (cupy.), what follows is a llm caveman summary: Why pyvkfft faster than cupy.fft and cuFFTDx: Core reason: VkFFT fuse 3-D FFT → single kernel. cuFFT/cuFFTDx decompose → 3 kernels (one per axis). Each extra kernel = full global-memory roundtrip on batch array. Attempt 1 — cp.fft with explicit plans: 21,955 kernel launches vs VkFFT's 6 Attempt 2 — cuFFTDx Block mode with Same 3-launch decomposition as cuFFT → same ~8s expected Thread mode = entire FFT in registers, single kernel, matches VkFFT Verdict: pyvkfft stays. No NVIDIA-native Python API does fused batched 3-D FFT for axes >32. |
Used llm to implement batched cpu correlator following is a caveman summary: Built CPU batched path. Added separate batched helpers, kept old CPU helpers intact. Wired batched mode into single CPU and multi CPU paths. Enabled batch-size on CPU CLI path. Added CPU max batch guesstimate from host RAM. Validation Regression test pass for: nproc 1, batch-size 100 nproc 6, batch-size 100 serial baseline batch-size 0 also pass So correctness same. No result drift. Perf results on real case (70728 rotations, shape 45x35x35) batch-size 0: 2m34-2m35, fastest batch-size 1: 2m48 batch-size 2: 2m50 batch-size 5: 2m47 batch-size 100: 3m05 batch-size 5000: 3m23, huge RAM use (~13 GB) Why slower on CPU GPU batching win from launch amortization + device parallel kernels. CPU different: already heavy time inside native FFT kernels. Batched CPU adds big 4D working sets, more memory traffic, worse cache locality. Larger batches increase RAM pressure, hurt throughput. Extra batch reduction work (argmax/max over chunk) adds overhead. Net: no loop-overhead win big enough to beat memory/throughput penalty. Profiler supports this FFT stages still dominant. rotate_grid3d time almost unchanged. Added batch reduction costs visible. Total runtime increases vs serial. Conclusion CPU batching is dead end for this workload/hardware. Best CPU setting: batch-size 0 (serial CPU correlator). Keep batching for GPU backends only as performance feature. CPU batch support can stay as functional path, but not recommended for speed
BSchilperoort
left a comment
There was a problem hiding this comment.
Some small lingering issues. Why wouldn't we remove the serial correlators for GPU altogether? I do understand that it's nice that the OpenCL serial class looks very similar to the CPU implementation, but I'm not sure if they're required seeing as a batch-size of 1 doesn't have a worse performance than 0.
| self.conj_multiply_kernel(a, b, out) | ||
|
|
||
| def compute_batch_lcc_score_and_take_best(self, batch_start: int, chunk_size: int): | ||
| block = 256 |
There was a problem hiding this comment.
Still this random number here that I commented on in my previous review.
There was a problem hiding this comment.
Added comment with different blocks tested
|
|
||
|
|
||
| class OpenCLSerialCorrelator(Correlator): | ||
| class OpenCLSerialCorrelator(SerialCorrelator): |
There was a problem hiding this comment.
Some comments on remaining duplicated code;
- OpenCL correlator initializers still share 9 lines of logic (2/3rds for
Serial.__init__, about half ofBatched.__init__. _set_template_varand_set_mask_varare complete duplicates. Could be inherited with a mix-in class instead...init_varscan be also put in mix-in class. Just doself.batch_size = Nonein the serial class, and move theempty_lcc_ft=Truelogic to within theinit_correlator_varsfunction (only set to true if batch-size is not None).
Alternatively we could also just remove the serial correlator here, I do not see what its use is. On my machine batch-size==1 is actually slightly faster than batch-size==0.
There was a problem hiding this comment.
Add batch-size=0 for all machines to performance.md. Also find combi where it is faster than batch size 1.
| * m2: AMD Ryzen 7 7800X3D and AMD Radeon RX 7900 XTX | ||
| * m3: AMD EPYC 9554 and NVIDIA RTX 6000 Ada | ||
| * m4: Intel i7-13700H and NVIDIA RTX 4050 Laptop via WSL | ||
|
|
There was a problem hiding this comment.
Shall I add results from my machine here, or is it not necessary seeing as you already tested the 7900 XTX (even though my gpu is slightly faster at powerfit 😉 )
There was a problem hiding this comment.
Please do in follow up PR
|
|
||
|
|
||
| class CUDASerialCorrelator(Correlator): | ||
| class CUDASerialCorrelator(SerialCorrelator): |
There was a problem hiding this comment.
Similar duplication issue as with the OpenCL classes here.

This PR adds:
--progressbar(on CPU only)TODO
See https://github.com/haddocking/powerfit/blob/cuda/docs/performance.md for current speed
Performance on master branch -> cuda branch (search time, best batch size)