I have read some previous threads on this issue. It seems that Pytorch is using Magma for GPU linear algebra as mentioned in this thread:
Issues about symeig and svd on GPU
While using Magma for hybrid systems is a good idea when there is only one process running since it uses both CPU and GPU to make computation as fast as possible for different sized matrices.
However, I found when I run several jobs together the symeig function quickly makes my CPU as a hugh bottleneck. The issue is exactly as reported in the above thread, when I have a matrix A as torch.cuda.floattensor, size 1024 x 1024 (I think for size like this, Magma chooses to place major portion of the computation on CPU). torch.symeig(A) occupies all of the CPU cores. Then if I run 4 such processes independently on 4 GPUs, it scales really poorly and the CPU becomes the bottleneck and my GPUs are all relatively empty.
Is that possible to make a feature to force symeig running only on GPU? Like @smth mentioned here:
switch CUDA svd and qr to using cuSolver
I appreciate Magma’s potential acceleration for a single process on a hybrid system, but usually on servers it’s better to put the computation tasks on GPUs or at least have such an option. Also, can anyone suggest a workaround for now? Would cupy + pytorch be a good option? I have many GPUs and would like to run independent jobs on them while only using only CPU to feed them. Thanks a lot in advance!
My quick bench mark is that Magma is about 2.4 times faster for matrices from 512x512~2048x2048~4096x4096 shows that Magma is about 5x ~ 2.5x ~ 2x faster than cupy for symmetric matrix eigen-decomposition. This consistent with the number reported by SsnL. It seems MAGMA is a good feature to keep in case there is only one process running.