Pytorch 1.2 cuda 10.0 vs. pytorch 1.9 cuda 11.1 significant slowdown

spivakoa · July 7, 2021, 3:06pm

Hello,
I recently upgraded my pytorch from 1.2 (CUDA 10.0) to the latest 1.9 (CUDA 11.1).

However using the same RTX 2080 TI I observe massive slowdown in inference (the same code is used). Using pytorch autograd profiler I see

Pytorch 1.2, CUDA 10.0
convolution 0.74% 386.450us 8.93% 4.690ms 72.161us 9.91% 9.902ms 152.342us
_convolution 1.57% 824.964us 8.20% 4.304ms 66.216us 9.60% 9.592ms 147.563us
conv2d 0.72% 377.362us 8.82% 4.632ms 75.928us 9.10% 9.086ms 148.953us

Pytorch 1.9, CUDA 11.1
convolution 1.33% 850.084us 17.73% 11.327ms 174.265us 270.592us 0.34% 26.337ms 405.189us
_convolution 2.03% 1.295ms 16.40% 10.477ms 161.187us 321.695us 0.40% 26.067ms 401.026us
conv2d 1.22% 781.662us 17.38% 11.105ms 182.043us 247.232us 0.31% 25.043ms 410.548us

As can be clearly seen, there is almost 3 times slowdown with the new CUDA.

Looking for suggestions and answers,

Thank you,
Alex

eqy · July 7, 2021, 6:14pm

Can you share some more details of the model or a reproducible snippet that isolates the performance difference? It might also be useful to check a nvprof run to see if the same CUDA kernels are being used in 1.2 and 1.9.

spivakoa · July 8, 2021, 8:11pm

Hi eqy,
Thank you for a reply.
I managed to break down the problem and found that the main slow down comes from the ConvTranspose2d layer.
I reproduced the problem with the following simple network:
class NeuralNetwork(nn.Module):
def init(self):
super(NeuralNetwork, self).init()
self.conv = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=0)
self.uplayer = nn.ConvTranspose2d(64, 64, kernel_size=2, stride=2, padding=0, bias=False)

def forward(self, x):
    x = self.conv(x)
    logits = self.uplayer(x)
    return logits

input_tensor = torch.randn(1,3,1520,256,requires_grad=False)
input_tensor = input_tensor.cuda()

model = NeuralNetwork().cuda()
for _ in range(5):
with torch.autograd.profiler.profile(use_cuda=True) as prof:
model(input_tensor)
print(prof.key_averages().table(sort_by=“cuda_time_total”, row_limit=10))

1. Running on pytorch 1.2 cuda 10.0 gives the following profiling results:
convolution 7.43% 27.945us 92.53% 348.177us 174.089us 25.18% 1.679ms 839.296us 2
_convolution 15.89% 59.809us 85.10% 320.232us 160.116us 24.96% 1.664ms 832.048us 2
conv_transpose2d 3.54% 13.329us 39.45% 148.457us 148.457us 19.99% 1.332ms 1.332ms 1
cudnn_convolution_transpose 23.62% 88.879us 23.62% 88.879us 88.879us 19.74% 1.316ms 1.316ms 1
conv2d 3.93% 14.784us 60.55% 227.833us 227.833us 5.42% 361.120us 361.120us 1
cudnn_convolution 41.87% 157.546us 41.87% 157.546us 157.546us 4.58% 305.568us 305.568us 1
contiguous 3.72% 13.998us 3.72% 13.998us 6.999us 0.14% 9.568us 4.784us 2

2. Running on pytorch 1.9 cuda 11.1 gives the following profiling results:
aten::convolution 0.35% 13.937us 99.62% 4.016ms 2.008ms 9.824us 0.04% 22.325ms 11.162ms 2
aten::convolution 0.83% 33.469us 99.27% 4.002ms 2.001ms 15.232us 0.07% 22.315ms 11.157ms 2
aten::conv_transpose2d 0.16% 6.562us 96.00% 3.870ms 3.870ms 3.840us 0.02% 22.044ms 22.044ms 1
aten::cudnn_convolution_transpose 95.22% 3.839ms 95.30% 3.842ms 3.842ms 22.034ms 98.65% 22.034ms 22.034ms 1
aten::conv2d 0.22% 8.838us 4.00% 161.430us 161.430us 5.984us 0.03% 290.208us 290.208us 1
aten::cudnn_convolution 2.34% 94.312us 2.45% 98.887us 98.887us 167.872us 0.75% 167.872us 167.872us 1
aten::add 0.44% 17.545us 0.44% 17.545us 17.545us 96.256us 0.43% 96.256us 96.256us 1
aten::reshape 0.11% 4.483us 0.26% 10.396us 10.396us 1.952us 0.01% 1.952us 1.952us 1
aten::resize_ 0.08% 3.043us 0.08% 3.043us 0.761us 0.000us 0.00% 0.000us 0.000us 4
aten::empty 0.12% 4.716us 0.12% 4.716us 2.358us 0.000us 0.00% 0.000us 0.000us 2

It can be clearly seen that the same convolution operations takes 10 times more with the latest version.
of pytorch.

Thank you,
Alex.