Potential for runtime optimization in CudaSolveFirstCollisionSourceDLR4thOrder

1) I see that most sparse matrices here are multiplied from the left while useing CSC storage. In most applications it should be faster to use row major matrices, and thus sparse matrices in CSR storage, for left hand multiplication. One could consider changing this, but I don't know what Julia optimizes internally. I also don't know what CuSparseMatrixCSR, for example, would do if initialized from a CSC matrix - Julia/CUDA might then just interpret this as a transposed CSR matrix, so just replacing the calls might not yield evident runtime benefit because we would still be doing the same thing behind the scenes.

2) Could we optimize the code below / put it on the GPU? It seems to take up a significant amount of time in the loop now (I might also have done mistakes while benchmarking, since I am not too familiar with Julia):
https://github.com/CSMMLab/CSD-DLRA/blob/5b1e1d8a9cb4812904dd3e080349d9b5b98df67f/code/SolverCSD.jl#L2147-L2169

If the q-loop is the slow part, I think one could make use some broadcasting of the beamx/y at least in the q loop, and with some more thought probably for the whole nested loop including q, andn put this on the GPU. 

Also, the Dvec seems to be indexed on the GPU, I also once get a warning about indexing in GPU arrays when using this method. I think this might also create some issues in runtime.

Finally, I haven't checked the other branches if there are potentially

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential for runtime optimization in CudaSolveFirstCollisionSourceDLR4thOrder #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential for runtime optimization in CudaSolveFirstCollisionSourceDLR4thOrder #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions