Solving linear equations is very slow

Hello!
I have to solve linear system when using pytorch, and I use the “torch.gesv” function to realize it. But, I find that this function is too slow on both GPU and CPU, why?
Here is my code:

import pdb
import time
import torch

x = torch.rand([600,600]);
x = x.cuda();
y = torch.rand([600,1]);
y = y.cuda();

start = time.time()
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
end = time.time()

print (end-start)/10.0

When matrix size is 600, the time consuming of once calculation is 43ms and 28ms on GPU and CPU, respectively, but in MATLAB, it only consume 3ms on CPU.
When matrix size is 3600, the time consuming of once calculation is 65ms and 64ms on GPU and CPU.
So, I wonder why pytorch is so slow and why GPU is slower than CPU?
Thanks for your answer!

Hi,

These functions come from the blas/lapack libraries and are not implemented by pytorch backend directly.
Which blas library do you use for pytorch? For the gpu, do you use magma?

Thank you for your reply!
I installed pyTorch using the command “conda install pytorch torchvision -c pytorch”, and do nothing else. So do I need install the magma and blas extraly?

I am not sure what is included in the pre-packaged version.
Do you have the mkl package and magma package installed in conda with pytorch?

How can I check it? I am sorry I did not search the corresponding instructions.@albanD

Now, the key issue is the speed on GPU is much slower than CPU when the matrix A is small, how can I solve it, I wish they can cost same time, thank you!@albanD

I don’t use conda personnaly. But something like conda list (or something similar, you may want to double check on google) should get you the list of the installed packages.

For small matrices, it is possible that the GPU runtime is larger than the CPU one. In particular, sending the job to the gpu and asking for the job to be done takes some time.
Also to get proper timings on gpu, you need to do the following every time you call time.time():

torch.cuda.synchronize()
current_time = time.time()

Yes, I added it, and here is my code

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
for i in range(10):
  w = torch.gesv(y,x)
end = time.time()

print (end-start)/10.0

x = torch.rand([30,30])
y = torch.rand([30,1])

w = torch.gesv(y,x)

start = time.time()
for i in range(10):
  w = torch.gesv(y,x)
end = time.time()

print (end-start)/10.0

the test result is 0.00462810993195 on GPU and 2.31027603149e-05 on CPU.
This confused me a long time.@albanD

This code is not correct, you need to add torch.cuda.synchronize() before every call to time.time() not just the first one.

Thank you for your reply. OK, now, I write this test code like this

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
end = time.time()

print (end-start)

x = torch.rand([30,30])
y = torch.rand([30,1])

w = torch.gesv(y,x)

start = time.time()
w = torch.gesv(y,x)
end = time.time()

print (end-start)

the result stills 0.00907206535339 s on GPU and 5.60283660889e-05s on CPU, why?

Hi,

Here is the code that does proper timings:

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
torch.cuda.synchronize()
end = time.time()

print "GPU before warmup"
print (end-start)

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
torch.cuda.synchronize()
end = time.time()

print "GPU after warmup"
print (end-start)

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
# MISSING synch here, so we only measure the time to launch the kernel
# not the time to execute it
# torch.cuda.synchronize()
end = time.time()

print "GPU just launch time"
print (end-start)

x = torch.rand([30,30])
y = torch.rand([30,1])

w = torch.gesv(y,x)

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
torch.cuda.synchronize()
end = time.time()

print "CPU"
print (end-start)

From the results you can see that the time is mostly spent launching the job on the gpu, no actually doing any computation.
For very small workload, it is expected that running on gpu is going to be slower as the cost to launch the job is fixed and will become dominant for very small workloads.

Thank you very much for your attention!
What you mean is the cost time 0.009 s I measure, most of them are the startup time of the GPU?
So, how can I measure the real execution time of w = torch.gesv(y,x) on GPU?@albanD

The timing "GPU after warmup" is the actual execution time. You can’t avoid paying this startup cost.

Thank you for your patience! Briefly, in actually, when we forward an image to a CNN model (on GPU), then I need to execute the torch.gesv(y,x) for 1000 times (on GPU too), where the size of x is 30*30. Then, the execution time in my actual task for once torch.gesv(y,x) is 0.009s or 5.6e-05?@albanD

I wonder that dose every instruction about GPU calculation in pytorch need to warmup or torch.gesv is special?
Intuitively, after the after the first execution of torch.gesv, the GPU should have been warmup, however, why each time to execute the torch.gesv, we all need the warmup process?@albanD

There is a warmup. That only happens once where the GPU is initialized.
Then what you see is the time to launch the job. Litterally the time to send the instruction to the GPU to perform a given operation. Because of hardware, this has a non-zero cost especially compared to executing something on the CPU itself.
The fact that the GPU is slower than the CPU for very small operation is completely expected.

So, as you say, when I execute

w = torch.gesv(y,x)
w = torch.gesv(y,x)
w = torch.gesv(y,x)
w = torch.gesv(y,x)
w = torch.gesv(y,x)

the warmup happens at the first line or happens at every line?

The results from the code with my machine is:

GPU before warmup
0.269862890244
GPU after warmup
0.00235605239868
GPU just launch time
0.00227999687195
CPU
2.8133392334e-05

This means:

  • The first gpu call of you program perform initialization of your gpu (warmup) and thus takes few hundreds of ms (or more if you have many gpus).
  • Then the actual execution time whenever you’re gonna run this line of code is 2ms in this case.
  • The third timing does not wait for the device to finish it’s computation so it just measures how long it takes to ask for the operation to be done to the GPU.
  • The last one is CPU time.

In this case, for the gesv operation, performing the operation for such small matrix has a high overhead on the GPU.

I ran code like this

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

# warmup
w = torch.gesv(y,x)

start = time.time()
w = torch.gesv(y,x)
end = time.time()

print "do once gesv:"
print end-start

start = time.time()
for i in range(100):
  w = torch.gesv(y,x)
end = time.time()

print "do hundrad gesv:"
print end-start

# warmup
w = torch.add(y,x)

start = time.time()
w = torch.add(y,x)
end = time.time()

print "do once add:"
print end-start

start = time.time()
for i in range(100):
  w = torch.add(y,x)
end = time.time()

print "do hundrad add:"
print end-start

we can find that the time of ‘do ten gesv’ is about ten times as much as ‘do once gesv’, however the time of “do once add” and “do ten add” almost the same, Why is that?

Here again, you’re missing all the torch.cuda.synchronize() to get proper timings.
If you add them, you’ll see that in both cases, doing it hundred times is two order of magnitude slower.