When using the precompiled Pytorch (on windows) from the official homepage, what backend is then used for linear algebra on GPU? Is it Magma or something else? As I have understood/misunderstood it one can choose this when compiling.
Background: I want to do batched lu and cholesky factorization and it is a bit slow when applyed to many small matrices (200 x 200). For instance, the same operation is twice as fast in Tensorflow. And when looking at some research it seems like implementation can matter a lot. Up to a factor 10.
Thanks! I had actually hoped not. Magma is as far as I understand really good, and still in LU factorisation I only get about 30GFlop/s on a card that can do 10TFlop/s.
Note that many small matrices is usually the worst case scenario in terms of performance for a GPU. So I wouldn’t be very surprised not to be anywhere near to theoretical maximum card throughput.
Is it better with larger matrices?