Pytorch 1.6 (and 1.5) is slower than 1.4 using torch.autograd.profiler

Hengck · May 17, 2020, 2:58am

my profile code is as follows

    net = torchvision.models.resnext50_32x4d(pretrained=True).cuda()
    net.train()
    with torch.autograd.profiler.profile(use_cuda=True) as prof:
        predict = net(input)
    print(prof)

code and results can be found at: https://drive.google.com/open?id=1vyTkqBpQwUvCSnGjjJPdzIzLTEfJF8Sr

here are the snippet results for pytorch 1.4 and 1.6 :

...
is_leaf                      0.00%            1.850us          0.00%            1.850us          1.850us          0.00%            1.024us          1.024us          1                []                                   
is_leaf                      0.00%            2.119us          0.00%            2.119us          2.119us          0.00%            2.048us          2.048us          1                []                                   
max_pool2d_with_indices      0.05%            479.919us        0.05%            479.919us        479.919us        0.19%            2.061ms          2.061ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 1.034s
CUDA time total: 1.097s

accuracy 0.687500

pytorch
	torch.__version__              = 1.6.0.dev20200516
	torch.version.cuda             = 10.2
	torch.backends.cudnn.version() = 7605
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

cudnn_convolution            0.12%            533.067us        0.12%            533.067us        533.067us        0.63%            4.750ms          4.750ms          1                []                                   
add                          0.00%            21.091us         0.00%            21.091us         21.091us         0.00%            9.215us          9.215us          1                []                                   
batch_norm                   0.07%            324.928us        0.07%            324.928us        324.928us        0.23%            1.723ms          1.723ms          1                []                                   
_batch_norm_impl_index       0.07%            318.668us        0.07%            318.668us        318.668us        0.23%            1.721ms          1.721ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 457.948ms
CUDA time total: 750.666ms

accuracy 0.687500

pytorch
	torch.__version__              = 1.4.0
	torch.version.cuda             = 10.1
	torch.backends.cudnn.version() = 7603
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

I also have similar observations for another machine using GTX 1080Ti.
Why is there a huge difference in the results?

ptrblck · May 17, 2020, 7:09am

Could you please post the slower layers here?
To isolate the issue further, I would also recommend to use the same CUDA and cudnn versions for the different PyTorch versions. Otherwise you are facing a lot of variables, which might affect the performance.

Hengck · May 18, 2020, 10:16am

Now i change to the same cuda and cudnn version. the results are similar as before. The network is standard torchvision resnet50 model. i also incoude the profiler chrome tracking

here are the results:

using pytorch1.6

output_nr                    0.00%            1.705us          0.00%            1.705us          1.705us          0.00%            2.048us          2.048us          1                []                                   
is_leaf                      0.00%            1.823us          0.00%            1.823us          1.823us          0.00%            1.023us          1.023us          1                []                                   
is_leaf                      0.00%            1.717us          0.00%            1.717us          1.717us          0.00%            1.024us          1.024us          1                []                                   
max_pool2d_with_indices      0.05%            486.468us        0.05%            486.468us        486.468us        0.17%            1.756ms          1.756ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 987.386ms
CUDA time total: 1.046s

pytorch
	torch.__version__              = 1.6.0.dev20200516+cu101
	torch.version.cuda             = 10.1
	torch.backends.cudnn.version() = 7603
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

using pytorch1.4

cudnn_convolution            0.16%            842.643us        0.16%            842.643us        842.643us        0.98%            8.491ms          8.491ms          1                []                                   
add                          0.00%            24.037us         0.00%            24.037us         24.037us         0.00%            9.215us          9.215us          1                []                                   
batch_norm                   0.06%            335.818us        0.06%            335.818us        335.818us        0.15%            1.270ms          1.270ms          1                []                                   
_batch_norm_impl_index       0.06%            329.220us        0.06%            329.220us        329.220us        0.15%            1.268ms          1.268ms          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 534.585ms
CUDA time total: 863.694ms

accuracy 0.687500

pytorch
	torch.__version__              = 1.4.0
	torch.version.cuda             = 10.1
	torch.backends.cudnn.version() = 7603
		torch.backends.cudnn.benchmark      = False
		torch.backends.cudnn.enabled        = True
		torch.backends.cudnn.deterministic  = False
	torch.cuda.device_count()      = 1
		torch.cuda.get_device_properties()  = _CudaDeviceProperties(name='TITAN X (Pascal)', major=6, minor=1, total_memory=12192MB, multi_processor_count=28)
		torch.cuda.memory_allocated()       = 0 GB
		torch.cuda.memory_reserved()        = 8 GB

again, all code, results can be find at : pytorch_profile - Google Drive

ptrblck · May 18, 2020, 7:34pm

Thanks for the additional run with the fixed libs!
We’ll triage this issue and check, where the difference is coming from.

@albanD do you have any quick idea, where the performance regression might be coming from?

albanD · May 18, 2020, 7:39pm

Not really.
Note that with matching versions, the difference is not as big.
Not sure why the conv does not show up any more on the profiling…

arnowaczynski · July 10, 2020, 2:53pm

@Hengck
I’ve encountered similar problem and discovered that it’s about contiguity of the input.

Try this:

prediction = net(input.contiguous())

You can read about breaking changes in release notes for 1.5.0 version (look for “contiguous”): https://github.com/pytorch/pytorch/releases/tag/v1.5.0

Script for reproducing slower runtime in Pytorch >= 1.5:

import numpy as np
import torch
import torchvision

net = torchvision.models.resnet50()
net.train()
net.cuda()

images_array = np.random.randn(4, 224, 224, 3).astype(np.float32)
images_array = np.rollaxis(images_array, 3, 1)
images_tensor = torch.from_numpy(images_array)
images_tensor = images_tensor.to(device="cuda")
print(images_tensor.is_contiguous())  # False

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    prediction = net(images_tensor)  # slower
    # prediction = net(images_tensor.contiguous())  # faster
print(prof)

albanD · July 10, 2020, 5:42pm

Thanks for the details !

cc @VitalyFedyunin who worked on this.

VitalyFedyunin · July 15, 2020, 12:27am

Thanks @arnowaczynski for the accurate description of the case. Indeed you can end up having channels last inputs and not optimized for it hardware.

The only thing I can add here is that images_tensor = images_tensor.to(device="cuda", memory_format=torch.contiguous_format) will work faster than images_tensor.contiguous()

FelixPetersen · June 15, 2021, 6:49am

Hi,
I also observed a slowdown, for me, it was a factor of 2.5 on CPU on an Intel Mac.
In my case, it was for a matrix multiplication-based application.
(Just as an additional input to whom it may concern.)