Torch.linalg.eigh is significantly slower on GPU

AlphaBetaGamma96 · January 4, 2022, 5:37pm

Hi All,

I’ve just noticed that torch.linalg.eigh is significantly slower when ran on the GPU than CPU, and I was wondering is this the expected behaviour of such an operation?

For example,

from time import time
import torch

matrices = torch.randn(10000, 200, 200)

t1=time()
torch.linalg.eigh(matrices)
torch.cuda.synchronize()
t2=time()
cpu_time = t2-t1

matrices = matrices.to(torch.device('cuda'))

t1=time()
torch.linalg.eigh(matrices)
torch.cuda.synchronize()
t2=time()
gpu_time = t2-t1

#cpu_time: 12.991785526275635 (s)
#gpu_time: 42.85719561576843 (s)

Is this expected behaviour or a bug?

Any feedback would be greatly appreciated!

rwchakra · June 23, 2022, 5:50am

Was this issue resolved? If not, checking the CUDA version you’re running may be helpful. torch.linalg.eigh has been reported to be a bit buggy on non 11.5 versions.

AlphaBetaGamma96 · June 23, 2022, 10:34am

I ran it with a few different versions and got similar results so it seems to still be buggy.

version:  1.12.0a0+git7c2103a
CUDA:  11.6
CPU time:  10.439055442810059
GPU time:  37.6059353351593

version:  1.11.0.dev20220201+cu111
CUDA:  11.1
CPU time:  9.315621852874756
GPU time:  40.59413170814514

rwchakra · June 24, 2022, 2:58pm

Hey, what is the torch version you’re using, and was it built with the updated cuda version? Please refer here

AlphaBetaGamma96 · June 24, 2022, 3:01pm

So I’m currently running PyTorch 1.12 (built from source) with CUDA 11.6 and the problem still persists. I tried with an old install of PyTorch 11.1 with CUDA 11.1 and the problem exists too.

rwchakra · June 24, 2022, 3:04pm

Can you try this: pip3 install --pre torch -f https://download.pytorch.org/whl/nightly/cu115/torch_nightly.html

KFrank · June 26, 2022, 10:28pm

Hi rwchakra!

I can reproduce Alpha’s eigh() gpu slowness on the latest nightly.

[Edit: I’ve also reproduced the gpu slowness on a cuda-11.6 nightly,
1.13.0.dev20220626+cu116, that @ptrblck pointed me to.]

Here is my slightly-tweaked version of Alpha’s code:

from time import time

import torch

print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())

_ = torch.manual_seed (2022)

matrices = torch.randn(10000, 200, 200)

t1=time()
torch.linalg.eigh(matrices)
torch.cuda.synchronize()
t2=time()
cpu_time = t2-t1

print ('cpu_time:', cpu_time)

matrices = matrices.to(torch.device('cuda'))

t1=time()
torch.linalg.eigh(matrices)
torch.cuda.synchronize()
t2=time()
gpu_time = t2-t1

print ('gpu_time:', gpu_time)

And here is the output:

1.13.0.dev20220626
11.3
GeForce GTX 1050 Ti
cpu_time: 17.44167733192444
gpu_time: 53.74928617477417

(I see the same timings on pytorch version 1.11.0 / cuda 11.3.)

Best.

K. Frank

rwchakra · June 28, 2022, 8:35am

Hi KFrank and Alpha,

torch.linalg.eigh assumes symmetric matrices to exploit the lower triangular portion. I used the codes posted here and observed similar results to yours (on Colab):

However, note what happens when I use a random 10000 * 10000 matrix instead:

Can you check whether this works?

AlphaBetaGamma96 · June 28, 2022, 10:55am

It seems that the GPU is much better at dealing with one larger matrix than a large number of smaller matrices. I get similar behaviour!

version:  1.11.0.dev20220201+cu111
CUDA:  11.1
CPU time:  40.16720700263977
GPU time:  5.0263426303863525

version:  1.12.0a0+git7c2103a
CUDA:  11.6
CPU time:  40.515045404434204
GPU time:  5.641661167144775

rwchakra · June 28, 2022, 12:46pm

I observe the same inefficient batching results on GPU for torch.linalg.eig too:

from time import time

import torch

print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())

_ = torch.manual_seed (2022)

matrices = torch.randn(10000, 200, 200)

t1=time()
torch.linalg.eig(matrices)
torch.cuda.synchronize()
t2=time()
cpu_time = t2-t1

print ('cpu_time:', cpu_time)

matrices = matrices.to(torch.device('cuda'))

t1=time()
torch.linalg.eig(matrices)
torch.cuda.synchronize()
t2=time()
gpu_time = t2-t1

print ('gpu_time:', gpu_time)

Output:

1.11.0+cu113
11.3
Tesla T4
cpu_time: 167.679447889328
gpu_time: 167.89750504493713

KFrank · June 28, 2022, 12:53pm

Hi Alpha!

I haven’t looked at the gpu implementation of eigh() (and if I had looked at
it I wouldn’t have understood a line of it), so this is speculation:

From the eigh() documentation:

Note

When inputs are on a CUDA device, this function synchronizes that device with the CPU.

This suggests to me that even when running on the gpu, the eigh() algorithm
has some bit of processing performed on the cpu, and that a synchronize()
is required. For a batch containing just a single matrix – or at least a single
large matrix – this doesn’t really matter, but for a batch of many small matrices,
it causes a significant reduction in performance.

For example, if the batch gpu algorithm were something as crude as:

cpu loop over size of batch:
   tell gpu to perform most of eigh() algorithm on current batch element
   torch.cuda.synchronize()
   complete eigh() on current batch element
   update result on gpu

you could imagine performing that back-and-forth with the cpu 10,000 times,
including the synchronize(), could really hurt performance.

One might speculate that the part of eigh() performed on the cpu could be
implemented on the gpu – even if algorithmically less efficient – for large
batch sizes. Or perhaps that synchronize() / back-and-forth could be
performed on groups of batch elements, rather than on individual elements,
one by one.

Again, just speculation.

Best.

K. Frank

AlphaBetaGamma96 · June 28, 2022, 1:25pm

Hi @KFrank!

Thanks for the detailed response! After reading your comments, it did give me a thought about potentially using functorch and seeing if that affects performance. This problem seems to primarily affect larger matrices, and using functorch actually seems to speed up the GPU to be better than the CPU for small matrices. I’ve got some results below with all time measured in seconds.

version:  1.12.0a0+git7c2103a
CUDA:  11.6 

Size of Tensor:  torch.Size([1000, 32, 32])
CPU time:  0.05937480926513672
GPU time:  0.5319147109985352
FUNC time:  0.004615306854248047 

Size of Tensor:  torch.Size([1000, 64, 64])
CPU time:  0.1272900104522705
GPU time:  0.8110277652740479
FUNC time:  0.7586314678192139 

Size of Tensor:  torch.Size([1000, 128, 128])
CPU time:  0.4362821578979492
GPU time:  1.8971400260925293
FUNC time:  1.9026517868041992 

Size of Tensor:  torch.Size([1000, 256, 256])
CPU time:  1.5289278030395508
GPU time:  5.590537071228027
FUNC time:  5.531857967376709

Size of Tensor:  torch.Size([10000, 32, 32])
CPU time:  0.4745197296142578
GPU time:  0.515204906463623
FUNC time:  0.04160284996032715 

Size of Tensor:  torch.Size([10000, 64, 64])
CPU time:  1.1662471294403076
GPU time:  6.958388805389404
FUNC time:  6.955511808395386 

Size of Tensor:  torch.Size([10000, 128, 128])
CPU time:  3.8899362087249756
GPU time:  18.134103775024414
FUNC time:  18.80874514579773

To reproduce these results the script is below,

from time import time
import torch

from functorch import vmap
veigh = vmap(torch.linalg.eigh) 
#same as vmap(torch.linalg.eigh, in_dims=(0))(matrices)

print("version: ",torch.__version__)
print("CUDA: ",torch.version.cuda, "\n")

for B in [1000]:
  for N in [32, 64, 128, 256]:
    matrices = torch.randn(B, N, N)
    matrices = matrices @ matrices.transpose(-2,-1)

    torch.cuda.synchronize()
    t1=time()
    torch.linalg.eigh(matrices)
    torch.cuda.synchronize()
    t2=time()
    cpu_time = t2-t1

    matrices = matrices.to(torch.device('cuda'))

    torch.cuda.synchronize()
    t1=time()
    torch.linalg.eigh(matrices)
    torch.cuda.synchronize()
    t2=time()
    gpu_time = t2-t1

    torch.cuda.synchronize()
    t1=time()
    veigh(matrices)
    torch.cuda.synchronize()
    t2=time()
    func_time=t2-t1

    print("Size of Tensor: ",matrices.shape)
    print("CPU time: ",cpu_time)
    print("GPU time: ",gpu_time)
    print("FUNC time: ",func_time, "\n")