Pytorch 1.6 (and 1.5) is slower than 1.4 using torch.autograd.profiler

my profile code is as follows

    net = torchvision.models.resnext50_32x4d(pretrained=True).cuda()
    with torch.autograd.profiler.profile(use_cuda=True) as prof:
        predict = net(input)

code and results can be found at:

here are the snippet results for pytorch 1.4 and 1.6 :

is_leaf                      0.00%            1.850us          0.00%            1.850us          1.850us          0.00%            1.024us          1.024us          1                []                                   
is_leaf                      0.00%            2.119us          0.00%            2.119us          2.119us          0.00%            2.048us          2.048us          1                []                                   
max_pool2d_with_indices      0.05%            479.919us        0.05%            479.919us        479.919us        0.19%            2.061ms          2.061ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 1.034s
CUDA time total: 1.097s

accuracy 0.687500

	torch.__version__              = 1.6.0.dev20200516
	torch.version.cuda             = 10.2
	torch.backends.cudnn.version() = 7605
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

cudnn_convolution            0.12%            533.067us        0.12%            533.067us        533.067us        0.63%            4.750ms          4.750ms          1                []                                   
add                          0.00%            21.091us         0.00%            21.091us         21.091us         0.00%            9.215us          9.215us          1                []                                   
batch_norm                   0.07%            324.928us        0.07%            324.928us        324.928us        0.23%            1.723ms          1.723ms          1                []                                   
_batch_norm_impl_index       0.07%            318.668us        0.07%            318.668us        318.668us        0.23%            1.721ms          1.721ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 457.948ms
CUDA time total: 750.666ms

accuracy 0.687500

	torch.__version__              = 1.4.0
	torch.version.cuda             = 10.1
	torch.backends.cudnn.version() = 7603
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

I also have similar observations for another machine using GTX 1080Ti.
Why is there a huge difference in the results?

1 Like

Could you please post the slower layers here?
To isolate the issue further, I would also recommend to use the same CUDA and cudnn versions for the different PyTorch versions. Otherwise you are facing a lot of variables, which might affect the performance.

Now i change to the same cuda and cudnn version. the results are similar as before. The network is standard torchvision resnet50 model. i also incoude the profiler chrome tracking

here are the results:

using pytorch1.6

output_nr                    0.00%            1.705us          0.00%            1.705us          1.705us          0.00%            2.048us          2.048us          1                []                                   
is_leaf                      0.00%            1.823us          0.00%            1.823us          1.823us          0.00%            1.023us          1.023us          1                []                                   
is_leaf                      0.00%            1.717us          0.00%            1.717us          1.717us          0.00%            1.024us          1.024us          1                []                                   
max_pool2d_with_indices      0.05%            486.468us        0.05%            486.468us        486.468us        0.17%            1.756ms          1.756ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 987.386ms
CUDA time total: 1.046s

	torch.__version__              = 1.6.0.dev20200516+cu101
	torch.version.cuda             = 10.1
	torch.backends.cudnn.version() = 7603
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

using pytorch1.4

cudnn_convolution            0.16%            842.643us        0.16%            842.643us        842.643us        0.98%            8.491ms          8.491ms          1                []                                   
add                          0.00%            24.037us         0.00%            24.037us         24.037us         0.00%            9.215us          9.215us          1                []                                   
batch_norm                   0.06%            335.818us        0.06%            335.818us        335.818us        0.15%            1.270ms          1.270ms          1                []                                   
_batch_norm_impl_index       0.06%            329.220us        0.06%            329.220us        329.220us        0.15%            1.268ms          1.268ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 534.585ms
CUDA time total: 863.694ms

accuracy 0.687500

	torch.__version__              = 1.4.0
	torch.version.cuda             = 10.1
	torch.backends.cudnn.version() = 7603
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

again, all code, results can be find at : pytorch_profile - Google Drive

1 Like

Thanks for the additional run with the fixed libs!
We’ll triage this issue and check, where the difference is coming from.

@albanD do you have any quick idea, where the performance regression might be coming from?

1 Like

Not really.
Note that with matching versions, the difference is not as big.
Not sure why the conv does not show up any more on the profiling…

I’ve encountered similar problem and discovered that it’s about contiguity of the input.

Try this:

prediction = net(input.contiguous())

You can read about breaking changes in release notes for 1.5.0 version (look for “contiguous”):

Script for reproducing slower runtime in Pytorch >= 1.5:

import numpy as np
import torch
import torchvision

net = torchvision.models.resnet50()

images_array = np.random.randn(4, 224, 224, 3).astype(np.float32)
images_array = np.rollaxis(images_array, 3, 1)
images_tensor = torch.from_numpy(images_array)
images_tensor ="cuda")
print(images_tensor.is_contiguous())  # False

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    prediction = net(images_tensor)  # slower
    # prediction = net(images_tensor.contiguous())  # faster

Thanks for the details !

cc @VitalyFedyunin who worked on this.

1 Like

Thanks @arnowaczynski for the accurate description of the case. Indeed you can end up having channels last inputs and not optimized for it hardware.

The only thing I can add here is that images_tensor ="cuda", memory_format=torch.contiguous_format) will work faster than images_tensor.contiguous()

1 Like

I also observed a slowdown, for me, it was a factor of 2.5 on CPU on an Intel Mac.
In my case, it was for a matrix multiplication-based application.
(Just as an additional input to whom it may concern.)