I was writing a module and found it runs slower in cuda() mode than cpu mode. And in my test code it works like the same.

```
'''Test code here'''
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import functools
import time
#test variables
t1 = 44
t2 = 44
kernel_size = 3
embedding_size = 64
batch_size = 32
weight = Variable(torch.Tensor( 1, embedding_size*2, kernel_size))
bias = Variable(torch.Tensor(1))
matrix = Variable(torch.abs(torch.randn(batch_size, embedding_size*2, 42*3)))
# test conv1d function
def times(fun):
@functools.wraps(fun)
def warpper(*args, **kwargs):
t1 = time.time()
ret = fun(*args, **kwargs)
print(time.time()-t1)
return ret
return warpper
@times
def conv(matrix, weight, bias):
return F.conv1d(matrix, weight, bias, stride=3)
# main test codes
##cpu
_ = conv(matrix, weight, bias) #0.004507303237915039
matrix=matrix.cuda()
weight = weight.cuda()
bias = bias.cuda()
##GPU
_ = conv(matrix, weight, bias) #0.599564790725708
_ = conv(matrix, weight, bias) #0.00015163421630859375
_ = conv(matrix, weight, bias) #9.274482727050781e-05
```

I test it first on cpu and got 0.0045s, and then on GPU got 0.5996s, 0.00015s, 0.00009s for three times.

Why is GPU slower than CPU at first and how to avoid the problem in my code like this?

```
# conv by row
matrix = []
for i in range(0, L1):
sub_x1 = x1[:, i: i+self.kernel_size] # [batch, kernel_size, embedding]
stride_x1 = sub_x1.repeat(1,L2,1) # [batch, kernel_size*L1, embedding]
conved = torch.cat([stride_x2, stride_x1], dim=-1).transpose(1,2)
matrix.append(F.conv1d(conved, self.weight, bias=self.bias, stride=self.kernel_size).squeeze())
```

Best wishes.