Conv2d is slow in pytorch build from source

I built from source code(v1.0.0), but Conv2 speed is slower than that installed with conda install command.

conv2d cpu time - 709909.681us cuda time - 709897.758us
conv2d cpu time - 3431.992us cuda time - 3430.720us

source code


class convT(nn.Module):

def __init__(self):
     super(convT, self).__init__()
     self.conv1=nn.Conv2d(1,1,5)

def forward(self, input):
     x = self.conv1(input)
     return x

if name == ‘main’:

x = torch.randn(1,1,200,200)
net = convT()
net.to(device)
x= x.cuda()

torch.cuda.synchronize()
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    out = net(x)
torch.cuda.synchronize()

print(prof.key_averages().table(sort_by='cuda_time_total'))
print(out.size())

environment config

PyTorch version: 1.0.0a0+bb15580
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X
GPU 2: GeForce GTX TITAN X
GPU 3: GeForce GTX TITAN X
GPU 4: GeForce GTX TITAN X
GPU 5: GeForce GTX TITAN X
GPU 6: GeForce GTX TITAN X
GPU 7: GeForce GTX TITAN X

Nvidia driver version: 384.81
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so
/usr/lib/x86_64-linux-gnu/libcudnn.so.7
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_static.a
/usr/lib/x86_64-linux-gnu/tmp/libcudnn.so.7.1.3
/usr/local/cuda-9.0/lib64/libcudnn.so
/usr/local/cuda-9.0/lib64/libcudnn.so.7
/usr/local/cuda-9.0/lib64/libcudnn.so.7.4.1
/usr/local/cuda-9.0/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] magma-cuda90 2.5.0 1 pytorch
[conda] torch 1.0.0a0+bb15580 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi

I built from source code(v1.0.0), but Conv2 speed is slower than that installed with conda install command.
conv2d cpu time - 709909.681us cuda time - 709897.758us
conv2d cpu time - 3431.992us cuda time - 3430.720us

I know that in general, for intel cpus, conda uses MKL (Math Kernel Library), which is a highly tuned a BLAS implementation for Intel CPUs. That could be one of the reasons explaining the CPU performance difference (you typically find similar performance differences when using NumPy when comparing conda and source/pip installations). However, I think on a GPU, you won’t notice that much of a difference.

Facing similar issue. I compared only GPU number. Are you able to locate the source of high latency?

Another observation from my end.

If you do a dummy convolution operation with the input and carry on with the network flow. You will get correct timing for the network.
def forward(self, x):
start0 = time.time()
x1 = self.conv2d(x)
end0 = time.time()
time0 = (end0 - start0)*1000
print(‘Dummy conv2d-depth - %f’%time0)

 start0 = time.time()
 x1 = self.conv2d(x)
 end0 = time.time()
 time0 = (end0 - start0)*1000
 print('conv2d-depth - %f'%time0)

i doubt the latency spike is something related to GPU initialization in native pytorch code.

@smth Need some pointers to locate the origin of latency increase

Thanks in advance

Note that CUDA operations are called asynchronously, so that you should add manual synchronization points before starting and stopping the timer using torch.cuda.synchronize().
Otherwise you might just time the kernel launch.

Made the changes… Problem stands

the first call takes 420 ms and the second call (same operations)1.96ms…

Could you post the code you are using to profile these operations?

Hi ptrblck,

Pytorch is installed from the source.
During the first call cublas handle creation is taking time, the same handle is used in the second call. I feel this is the reason for higher latency in the first call.