How to make symeig use GPU only given CUDA Tensor?

I have read some previous threads on this issue. It seems that Pytorch is using Magma for GPU linear algebra as mentioned in this thread:
Issues about symeig and svd on GPU

While using Magma for hybrid systems is a good idea when there is only one process running since it uses both CPU and GPU to make computation as fast as possible for different sized matrices.

However, I found when I run several jobs together the symeig function quickly makes my CPU as a hugh bottleneck. The issue is exactly as reported in the above thread, when I have a matrix A as torch.cuda.floattensor, size 1024 x 1024 (I think for size like this, Magma chooses to place major portion of the computation on CPU). torch.symeig(A) occupies all of the CPU cores. Then if I run 4 such processes independently on 4 GPUs, it scales really poorly and the CPU becomes the bottleneck and my GPUs are all relatively empty.

Is that possible to make a feature to force symeig running only on GPU? Like @smth mentioned here:
switch CUDA svd and qr to using cuSolver

I appreciate Magma’s potential acceleration for a single process on a hybrid system, but usually on servers it’s better to put the computation tasks on GPUs or at least have such an option. Also, can anyone suggest a workaround for now? Would cupy + pytorch be a good option? I have many GPUs and would like to run independent jobs on them while only using only CPU to feed them. Thanks a lot in advance!

My quick bench mark is that Magma is about 2.4 times faster for matrices from 512x512~2048x2048~4096x4096 shows that Magma is about 5x ~ 2.5x ~ 2x faster than cupy for symmetric matrix eigen-decomposition. This consistent with the number reported by SsnL. It seems MAGMA is a good feature to keep in case there is only one process running.

1 Like

If cupy does what you need then using it as a workaround is good. Just keep in mind autograd won’t work with the operations unless you define a backward function in a custom autograd function.

@richard Autograd is one issue, I can manually compute the gradients. But another issue is the communication.

Every time I need to do a svd or symeig, I have to copy the matrix from GPU to CPU and convert it to Cupy array and send to GPU back. Then after the actual operation, I copy it back to CPU and convert it to pytorch tensor and send back to GPU again. Using Cupy introduces this two round-trip communication, which is a little problematic.

A full-GPU implementation of eig and svd operations would be very helpful.