Gpu runs slower than cpu sometimes

I was writing a module and found it runs slower in cuda() mode than cpu mode. And in my test code it works like the same.

'''Test code here'''
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import functools
import time

#test variables
t1 = 44
t2 = 44
kernel_size = 3
embedding_size = 64
batch_size = 32

weight = Variable(torch.Tensor( 1, embedding_size*2, kernel_size))
bias = Variable(torch.Tensor(1))
matrix =  Variable(torch.abs(torch.randn(batch_size, embedding_size*2, 42*3)))

# test conv1d function
def times(fun):
    def warpper(*args, **kwargs):
        t1 = time.time()
        ret = fun(*args, **kwargs)
        return ret
    return warpper

def conv(matrix, weight, bias):
    return F.conv1d(matrix, weight, bias, stride=3)

# main test codes
_ = conv(matrix, weight, bias) #0.004507303237915039

weight = weight.cuda()
bias = bias.cuda()
_ = conv(matrix, weight, bias) #0.599564790725708
_ = conv(matrix, weight, bias) #0.00015163421630859375
_ = conv(matrix, weight, bias) #9.274482727050781e-05

I test it first on cpu and got 0.0045s, and then on GPU got 0.5996s, 0.00015s, 0.00009s for three times.

Why is GPU slower than CPU at first and how to avoid the problem in my code like this?

# conv by row
matrix = []
for i in range(0, L1):
    sub_x1 = x1[:, i: i+self.kernel_size] # [batch, kernel_size, embedding]
    stride_x1 = sub_x1.repeat(1,L2,1) # [batch, kernel_size*L1, embedding]
    conved =[stride_x2, stride_x1], dim=-1).transpose(1,2)
    matrix.append(F.conv1d(conved, self.weight, bias=self.bias, stride=self.kernel_size).squeeze())

Best wishes.

When pytorch gets its first CUDA call, it needs to initialize CUDA, which is probably why the first one you see has additional overhead.

If you want to benchmark it more accurately you have to make sure that you initialize CUDA first before beginning timing. Something like x = torch.randn(1).cuda() or even just torch.cuda.init().

Also, it is worth noting that that CUDA operations are run asynchronously, which means that you won’t get correct timings unless you synchronize the devices, e.g. torch.cuda.synchronize.

Thanks for your reply @richard @SimonW!
Yes. During my model training time I test the convolution function time and found the calculation is actually as fast as the last two GPU test case shown.
Maybe there is another reason increases the computing time during backword. I will try to find it and show it sometimes later:smiley: thank you.