Pytorch 1.7 slower on CUDA 11.0 than CUDA 10.2?


I am experiencing slower distributed training with the new PyTorch 1.7 built with CUDA 11.0 compared to 10.2! Has anyone benchmarked anything yet? I use the same script in two different environments one with CUDA 11.0 and the other with CUDA 10.2. The same script that takes 21 hrs for one epoch on CUDA 10.2 takes 24 hours on CUDA 11.0.

I guess the potential slowdown is not coming from distributed training (and thus NCCL) not from CUDA11, but might be coming from e.g. cudnn (which also depends on the device you are using).

Are you only seeing the slowdown using DDP or also using a single device? The latter case would point towards my assumption.

Could you give more information about your setup (GPU, model architecture) and also profile the training on a single device?

1 Like

I’m having similar issue with pytorch 1.7 w/ CUDA 11.0 compared to CUDA 10.1. I’m using 2080Ti as GPU.

Simple example that demonstrates this (only conv2d):

import torch
import torch.nn.functional as F

x = torch.randn(10, 64, 128, 128).cuda()
w = torch.randn(64, 64, 5, 5).cuda()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

# warmup
y = []
for _ in range(10):
    y.append(F.conv2d(torch.randn_like(x), w))

# measure
y = []
for _ in range(10):
    y.append(F.conv2d(torch.randn_like(x), w))
print('time = %.2f' % (start.elapsed_time(end),))


# pytorch 1.7 w/ cuda 10.1
# time = 21.05 +/- 0.05
# pytorch 1.7 w/ cuda 11.0
# time = 25.40 +/- 0.05

Could you update to PyTorch 1.7.1 with CUDA11.0 and cudnn8.0.5 and recheck the performance, please?

Thanks for the quick response!
I used 1.7.1 in my tests. Sorry for not providing patch-version in my previous comment.
I’ve uploaded my collect-env logs to a relevant github issue:

Interesting. Is the problem still present with Cuda 11.2? And have others been able to see the same problem?