I am having issues getting DistributedDataParallel to perform well (2 GPUs on the same host perform at ~85-90% of linear scaling, and it gets worse as GPUs or hosts are added). From slack, it seems other users are able to get much closer to 99% of linear with small numbers of nodes/GPUs.
I’m seeing this 85-90% scaling behavior on the (shared) work cluster, and on a 2 GPU system I have at home. I haven’t tested the full cross product, but I’ve seen the same behavior on Ubuntu 14.04 and 18.04; CUDA 9.1, 10.0, and 10.2; stock PyTorch 1.4 DDP and NVIDIA Apex DDP; resnet 50, 152, and some toy models. All used fake data from torchvision with batch sizes that use up the majority of GPU RAM.
The training script is here (with light edits to remove comments, etc.): https://gist.github.com/elistevens/7edacdafdb45747a22da2ef0c6ce1af3
OMP_NUM_THREADS=4 EPOCHS=2 EPOCH_SIZE=3840 BATCH_SIZE=64 NODES=1 GPUS=2 ~/v/bin/python min_ddp.py
etc.
The numbers here are from my 18.04 home system with 2x 1080 Tis. There’s roughly a three-second slowdown for the 2 GPU case, resulting in training going from 22 seconds (1 GPU, 1 epoch) to 25 seconds (2 GPUs, 2 epochs). About a second and a half of that is the {method 'acquire' of '_thread.lock' objects}
and the rest seems to be mul_
, add_
etc. methods of torch._C._TensorBase
objects.
Is this expected? Am I missing something that would cause performance to be poor like this?
Thanks for any help. More detailed data is below.
1 GPU
308413 function calls (297131 primitive calls) in 22.053 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
60 5.305 0.088 5.305 0.088 {method 'run_backward' of 'torch._C._EngineBase' objects}
19320 3.509 0.000 3.509 0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
19320 3.372 0.000 3.372 0.000 {method 'add_' of 'torch._C._TensorBase' objects}
9660 2.298 0.000 2.298 0.000 {method 'addcdiv_' of 'torch._C._TensorBase' objects}
9660 2.124 0.000 2.124 0.000 {method 'sqrt' of 'torch._C._TensorBase' objects}
60 1.741 0.029 14.598 0.243 /home/elis/v/lib/python3.6/site-packages/torch/optim/adam.py:49(step)
9660 1.499 0.000 1.499 0.000 {method 'addcmul_' of 'torch._C._TensorBase' objects}
224 0.671 0.003 0.671 0.003 {method 'acquire' of '_thread.lock' objects}
120 0.548 0.005 0.548 0.005 {method 'to' of 'torch._C._TensorBase' objects}
3180 0.141 0.000 0.141 0.000 {built-in method conv2d}
...
2 GPUs
312342 function calls (301058 primitive calls) in 25.171 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
60 5.355 0.089 5.355 0.089 {method 'run_backward' of 'torch._C._EngineBase' objects}
19320 4.015 0.000 4.015 0.000 {method 'mul_' of 'torch._C._TensorBase' objects}
19320 3.668 0.000 3.668 0.000 {method 'add_' of 'torch._C._TensorBase' objects}
9660 2.407 0.000 2.407 0.000 {method 'sqrt' of 'torch._C._TensorBase' objects}
9660 2.339 0.000 2.339 0.000 {method 'addcdiv_' of 'torch._C._TensorBase' objects}
264 2.089 0.008 2.089 0.008 {method 'acquire' of '_thread.lock' objects}
60 1.800 0.030 15.833 0.264 /home/elis/v/lib/python3.6/site-packages/torch/optim/adam.py:49(step)
9660 1.566 0.000 1.566 0.000 {method 'addcmul_' of 'torch._C._TensorBase' objects}
120 0.561 0.005 0.561 0.005 {method 'to' of 'torch._C._TensorBase' objects}
105 0.275 0.003 0.275 0.003 {built-in method posix.waitpid}
3180 0.252 0.000 0.252 0.000 {built-in method conv2d}
...
g2-g1 function
delta
1.418, {method 'acquire' of '_thread.lock' objects}
0.506, {method 'mul_' of 'torch._C._TensorBase' objects}
0.296, {method 'add_' of 'torch._C._TensorBase' objects}
0.283, {method 'sqrt' of 'torch._C._TensorBase' objects}
0.184, {built-in method posix.waitpid}
0.111, {built-in method conv2d}
0.067, {method 'addcmul_' of 'torch._C._TensorBase' objects}
0.059, /home/elis/v/lib/python3.6/site-packages/torch/optim/adam.py:49(step)
0.05, {method 'run_backward' of 'torch._C._EngineBase' objects}
0.049, {built-in method _posixsubprocess.fork_exec}
0.041, {method 'addcdiv_' of 'torch._C._TensorBase' objects}
0.037, {built-in method relu_}
0.023, {built-in method batch_norm}
0.015, {built-in method max_pool2d}
0.013, {method 'to' of 'torch._C._TensorBase' objects}
0.008, {built-in method torch.distributed._broadcast_coalesced}