Hello Maintainer.
In the current code, the parameters are passed in the order A, B, C to compute C = A × B, but logically this is inconsistent.
Conceptually, the correct approach is to instruct cuBLAS to multiply the transposed matrices in reverse order (B followed by A), so that it effectively computes the intended result. However, the lack of explanation around this can easily confuse new users.
This is not a functional bug, but rather an educational/documentation issue that could lead to misunderstanding.
Would you accept a small PR that adds the above comment to the code and a brief README clarification?
Thank you.