Skip to content

Potential for runtime optimization in CudaSolveFirstCollisionSourceDLR4thOrder #1

@wahln

Description

@wahln
  1. I see that most sparse matrices here are multiplied from the left while useing CSC storage. In most applications it should be faster to use row major matrices, and thus sparse matrices in CSR storage, for left hand multiplication. One could consider changing this, but I don't know what Julia optimizes internally. I also don't know what CuSparseMatrixCSR, for example, would do if initialized from a CSC matrix - Julia/CUDA might then just interpret this as a transposed CSR matrix, so just replacing the calls might not yield evident runtime benefit because we would still be doing the same thing behind the scenes.

  2. Could we optimize the code below / put it on the GPU? It seems to take up a significant amount of time in the loop now (I might also have done mistakes while benchmarking, since I am not too familiar with Julia):
    https://github.com/CSMMLab/CSD-DLRA/blob/5b1e1d8a9cb4812904dd3e080349d9b5b98df67f/code/SolverCSD.jl#L2147-L2169

If the q-loop is the slow part, I think one could make use some broadcasting of the beamx/y at least in the q loop, and with some more thought probably for the whole nested loop including q, andn put this on the GPU.

Also, the Dvec seems to be indexed on the GPU, I also once get a warning about indexing in GPU arrays when using this method. I think this might also create some issues in runtime.

Finally, I haven't checked the other branches if there are potentially

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions