DDP Learning-Rate

Ilia_Karmanov · July 7, 2020, 2:29pm

I was a bit confused how DDP (with NCCL) reduces gradients and the effect this has on the learning-rate that needs to be set.

Would the below example be a correct way to interpret this → that DDP and DP should have the same learning-rate if scaled out to the same effective batch-size?

Assume set contains 80 samples
Single-gpu LR = 0.1
Total-grad-distance = LR * g * (samples/batch-size)

Single-gpu
batch = 8
gradient = 8g/8 = g
total-grad-distance = 0.1 * g * 10 = g
DP (2-gpu, 1 node)
batch = 16
gradient = 16g/16 = g
total-grad-distance = 0.1 * g * 5 = 0.5g
→ thus scale LR by 2
DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
batch-per-process = 8
gradient = (8g/8) + (8g/8) / 2 = g
total-grad-distance = 0.1 * g * 5 = 0.5g
→ thus scale LR by 2?

Or does allreduce just sum the gradients in which case:

and ProcessGroup::allreduce() to sum gradients.

DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
batch-per-process = 8
gradient = (8g/8) + (8g/8) = 2g
total-grad-distance = 0.1 * 2g* 5 = g
→ thus leave LR the same as single-GPU

vfdev-5 · July 7, 2020, 3:06pm

If you maintain the same batch size between single GPU and DP/DDP, according to your calculations you do not need to adjust LR ?

PS:
In DDP grads are averaged: DistributedDataParallel — PyTorch master documentation

During the backwards pass, gradients from each node are averaged.

PPS: https://arxiv.org/pdf/1706.02677.pdf

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

mrshenli · July 8, 2020, 2:28pm

More discussions can be found at Should we split batch_size according to ngpu_per_node when DistributedDataparallel