Conv2d is slow in pytorch build from source

sgw · April 23, 2019, 5:14am

I built from source code(v1.0.0), but Conv2 speed is slower than that installed with conda install command.

conv2d cpu time - 709909.681us cuda time - 709897.758us
conv2d cpu time - 3431.992us cuda time - 3430.720us

source code

class convT(nn.Module):

def __init__(self):
     super(convT, self).__init__()
     self.conv1=nn.Conv2d(1,1,5)

def forward(self, input):
     x = self.conv1(input)
     return x

if name == ‘main’:

x = torch.randn(1,1,200,200)
net = convT()
net.to(device)
x= x.cuda()

torch.cuda.synchronize()
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    out = net(x)
torch.cuda.synchronize()

print(prof.key_averages().table(sort_by='cuda_time_total'))
print(out.size())

environment config

PyTorch version: 1.0.0a0+bb15580
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.14.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X
GPU 2: GeForce GTX TITAN X
GPU 3: GeForce GTX TITAN X
GPU 4: GeForce GTX TITAN X
GPU 5: GeForce GTX TITAN X
GPU 6: GeForce GTX TITAN X
GPU 7: GeForce GTX TITAN X

Nvidia driver version: 384.81
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so
/usr/lib/x86_64-linux-gnu/libcudnn.so.7
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_static.a
/usr/lib/x86_64-linux-gnu/tmp/libcudnn.so.7.1.3
/usr/local/cuda-9.0/lib64/libcudnn.so
/usr/local/cuda-9.0/lib64/libcudnn.so.7
/usr/local/cuda-9.0/lib64/libcudnn.so.7.4.1
/usr/local/cuda-9.0/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] Could not collect
[conda] magma-cuda90 2.5.0 1 pytorch
[conda] torch 1.0.0a0+bb15580 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi

rasbt · April 23, 2019, 5:43am

I built from source code(v1.0.0), but Conv2 speed is slower than that installed with conda install command.
conv2d cpu time - 709909.681us cuda time - 709897.758us
conv2d cpu time - 3431.992us cuda time - 3430.720us

I know that in general, for intel cpus, conda uses MKL (Math Kernel Library), which is a highly tuned a BLAS implementation for Intel CPUs. That could be one of the reasons explaining the CPU performance difference (you typically find similar performance differences when using NumPy when comparing conda and source/pip installations). However, I think on a GPU, you won’t notice that much of a difference.

shaik0440 · May 28, 2019, 9:34am

Facing similar issue. I compared only GPU number. Are you able to locate the source of high latency?

Another observation from my end.

If you do a dummy convolution operation with the input and carry on with the network flow. You will get correct timing for the network.
def forward(self, x):
start0 = time.time()
x1 = self.conv2d(x)
end0 = time.time()
time0 = (end0 - start0)*1000
print(‘Dummy conv2d-depth - %f’%time0)

 start0 = time.time()
 x1 = self.conv2d(x)
 end0 = time.time()
 time0 = (end0 - start0)*1000
 print('conv2d-depth - %f'%time0)

i doubt the latency spike is something related to GPU initialization in native pytorch code.

@smth Need some pointers to locate the origin of latency increase

Thanks in advance

ptrblck · May 28, 2019, 11:44am

Note that CUDA operations are called asynchronously, so that you should add manual synchronization points before starting and stopping the timer using torch.cuda.synchronize().
Otherwise you might just time the kernel launch.

shaik0440 · May 29, 2019, 4:36am

Made the changes… Problem stands

the first call takes 420 ms and the second call (same operations)1.96ms…

ptrblck · May 29, 2019, 10:29am

Could you post the code you are using to profile these operations?

shaik0440 · May 30, 2019, 6:39am

Hi ptrblck,

Pytorch is installed from the source.
During the first call cublas handle creation is taking time, the same handle is used in the second call. I feel this is the reason for higher latency in the first call.