I am experiencing slower distributed training with the new PyTorch 1.7 built with CUDA 11.0 compared to 10.2! Has anyone benchmarked anything yet? I use the same script in two different environments one with CUDA 11.0 and the other with CUDA 10.2. The same script that takes 21 hrs for one epoch on CUDA 10.2 takes 24 hours on CUDA 11.0.
I guess the potential slowdown is not coming from distributed training (and thus NCCL) not from CUDA11, but might be coming from e.g. cudnn (which also depends on the device you are using).
Are you only seeing the slowdown using DDP or also using a single device? The latter case would point towards my assumption.
Could you give more information about your setup (GPU, model architecture) and also profile the training on a single device?
I’m having similar issue with pytorch 1.7 w/ CUDA 11.0 compared to CUDA 10.1. I’m using 2080Ti as GPU.
Simple example that demonstrates this (only conv2d):
import torch
import torch.nn.functional as F
x = torch.randn(10, 64, 128, 128).cuda()
w = torch.randn(64, 64, 5, 5).cuda()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# warmup
y = []
for _ in range(10):
y.append(F.conv2d(torch.randn_like(x), w))
# measure
start.record()
y = []
for _ in range(10):
y.append(F.conv2d(torch.randn_like(x), w))
end.record()
torch.cuda.synchronize()
print('time = %.2f' % (start.elapsed_time(end),))
Results:
# pytorch 1.7 w/ cuda 10.1
# time = 21.05 +/- 0.05
# pytorch 1.7 w/ cuda 11.0
# time = 25.40 +/- 0.05