Solving linear equations is very slow

linyu · May 9, 2018, 3:35am

Hello!
I have to solve linear system when using pytorch, and I use the “torch.gesv” function to realize it. But, I find that this function is too slow on both GPU and CPU, why?
Here is my code:

import pdb
import time
import torch

x = torch.rand([600,600]);
x = x.cuda();
y = torch.rand([600,1]);
y = y.cuda();

start = time.time()
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
w = torch.gesv(y,x);
end = time.time()

print (end-start)/10.0

When matrix size is 600, the time consuming of once calculation is 43ms and 28ms on GPU and CPU, respectively, but in MATLAB, it only consume 3ms on CPU.
When matrix size is 3600, the time consuming of once calculation is 65ms and 64ms on GPU and CPU.
So, I wonder why pytorch is so slow and why GPU is slower than CPU?
Thanks for your answer!

albanD · May 9, 2018, 9:04am

Hi,

These functions come from the blas/lapack libraries and are not implemented by pytorch backend directly.
Which blas library do you use for pytorch? For the gpu, do you use magma?

linyu · May 9, 2018, 12:06pm

Thank you for your reply!
I installed pyTorch using the command “conda install pytorch torchvision -c pytorch”, and do nothing else. So do I need install the magma and blas extraly?

albanD · May 9, 2018, 3:44pm

I am not sure what is included in the pre-packaged version.
Do you have the mkl package and magma package installed in conda with pytorch?

linyu · May 10, 2018, 12:33am

How can I check it? I am sorry I did not search the corresponding instructions.@albanD

linyu · May 10, 2018, 7:44am

Now, the key issue is the speed on GPU is much slower than CPU when the matrix A is small, how can I solve it, I wish they can cost same time, thank you!@albanD

albanD · May 10, 2018, 8:46am

I don’t use conda personnaly. But something like conda list (or something similar, you may want to double check on google) should get you the list of the installed packages.

For small matrices, it is possible that the GPU runtime is larger than the CPU one. In particular, sending the job to the gpu and asking for the job to be done takes some time.
Also to get proper timings on gpu, you need to do the following every time you call time.time():

torch.cuda.synchronize()
current_time = time.time()

linyu · May 10, 2018, 8:50am

Yes, I added it, and here is my code

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
for i in range(10):
  w = torch.gesv(y,x)
end = time.time()

print (end-start)/10.0

x = torch.rand([30,30])
y = torch.rand([30,1])

w = torch.gesv(y,x)

start = time.time()
for i in range(10):
  w = torch.gesv(y,x)
end = time.time()

print (end-start)/10.0

the test result is 0.00462810993195 on GPU and 2.31027603149e-05 on CPU.
This confused me a long time.@albanD

albanD · May 10, 2018, 9:09am

This code is not correct, you need to add torch.cuda.synchronize() before every call to time.time() not just the first one.

linyu · May 10, 2018, 10:50am

Thank you for your reply. OK, now, I write this test code like this

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
end = time.time()

print (end-start)

x = torch.rand([30,30])
y = torch.rand([30,1])

w = torch.gesv(y,x)

start = time.time()
w = torch.gesv(y,x)
end = time.time()

print (end-start)

the result stills 0.00907206535339 s on GPU and 5.60283660889e-05s on CPU, why?

albanD · May 10, 2018, 11:02am

Hi,

Here is the code that does proper timings:

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
torch.cuda.synchronize()
end = time.time()

print "GPU before warmup"
print (end-start)

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
torch.cuda.synchronize()
end = time.time()

print "GPU after warmup"
print (end-start)

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
# MISSING synch here, so we only measure the time to launch the kernel
# not the time to execute it
# torch.cuda.synchronize()
end = time.time()

print "GPU just launch time"
print (end-start)

x = torch.rand([30,30])
y = torch.rand([30,1])

w = torch.gesv(y,x)

torch.cuda.synchronize()
start = time.time()
w = torch.gesv(y,x)
torch.cuda.synchronize()
end = time.time()

print "CPU"
print (end-start)

From the results you can see that the time is mostly spent launching the job on the gpu, no actually doing any computation.
For very small workload, it is expected that running on gpu is going to be slower as the cost to launch the job is fixed and will become dominant for very small workloads.

linyu · May 10, 2018, 11:14am

Thank you very much for your attention!
What you mean is the cost time 0.009 s I measure, most of them are the startup time of the GPU?
So, how can I measure the real execution time of w = torch.gesv(y,x) on GPU?@albanD

albanD · May 10, 2018, 11:22am

The timing "GPU after warmup" is the actual execution time. You can’t avoid paying this startup cost.

linyu · May 10, 2018, 11:26am

Thank you for your patience! Briefly, in actually, when we forward an image to a CNN model (on GPU), then I need to execute the torch.gesv(y,x) for 1000 times (on GPU too), where the size of x is 30*30. Then, the execution time in my actual task for once torch.gesv(y,x) is 0.009s or 5.6e-05?@albanD

linyu · May 10, 2018, 11:43am

I wonder that dose every instruction about GPU calculation in pytorch need to warmup or torch.gesv is special?
Intuitively, after the after the first execution of torch.gesv, the GPU should have been warmup, however, why each time to execute the torch.gesv, we all need the warmup process?@albanD

albanD · May 10, 2018, 12:13pm

There is a warmup. That only happens once where the GPU is initialized.
Then what you see is the time to launch the job. Litterally the time to send the instruction to the GPU to perform a given operation. Because of hardware, this has a non-zero cost especially compared to executing something on the CPU itself.
The fact that the GPU is slower than the CPU for very small operation is completely expected.

linyu · May 10, 2018, 12:17pm

So, as you say, when I execute

w = torch.gesv(y,x)
w = torch.gesv(y,x)
w = torch.gesv(y,x)
w = torch.gesv(y,x)
w = torch.gesv(y,x)

the warmup happens at the first line or happens at every line?

albanD · May 10, 2018, 12:34pm

The results from the code with my machine is:

GPU before warmup
0.269862890244
GPU after warmup
0.00235605239868
GPU just launch time
0.00227999687195
CPU
2.8133392334e-05

This means:

The first gpu call of you program perform initialization of your gpu (warmup) and thus takes few hundreds of ms (or more if you have many gpus).
Then the actual execution time whenever you’re gonna run this line of code is 2ms in this case.
The third timing does not wait for the device to finish it’s computation so it just measures how long it takes to ask for the operation to be done to the GPU.
The last one is CPU time.

In this case, for the gesv operation, performing the operation for such small matrix has a high overhead on the GPU.

linyu · May 10, 2018, 12:38pm

I ran code like this

import pdb
import time
import torch

x = torch.rand([30,30])
x = x.cuda()
y = torch.rand([30,1])
y = y.cuda()

# warmup
w = torch.gesv(y,x)

start = time.time()
w = torch.gesv(y,x)
end = time.time()

print "do once gesv:"
print end-start

start = time.time()
for i in range(100):
  w = torch.gesv(y,x)
end = time.time()

print "do hundrad gesv:"
print end-start

# warmup
w = torch.add(y,x)

start = time.time()
w = torch.add(y,x)
end = time.time()

print "do once add:"
print end-start

start = time.time()
for i in range(100):
  w = torch.add(y,x)
end = time.time()

print "do hundrad add:"
print end-start

we can find that the time of ‘do ten gesv’ is about ten times as much as ‘do once gesv’, however the time of “do once add” and “do ten add” almost the same, Why is that?

albanD · May 10, 2018, 12:43pm

Here again, you’re missing all the torch.cuda.synchronize() to get proper timings.
If you add them, you’ll see that in both cases, doing it hundred times is two order of magnitude slower.