# DDP Learning-Rate

I was a bit confused how DDP (with NCCL) reduces gradients and the effect this has on the learning-rate that needs to be set.

Would the below example be a correct way to interpret this -> that DDP and DP should have the same learning-rate if scaled out to the same effective batch-size?

Assume set contains 80 samples
Single-gpu LR = 0.1
Total-grad-distance = LR * g * (samples/batch-size)

1. Single-gpu
batch = 8
total-grad-distance = 0.1 * g * 10 = g

2. DP (2-gpu, 1 node)
batch = 16
total-grad-distance = 0.1 * g * 5 = 0.5g
-> thus scale LR by 2

3. DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
batch-per-process = 8
gradient = (8g/8) + (8g/8) / 2 = g
total-grad-distance = 0.1 * g * 5 = 0.5g
-> thus scale LR by 2?

Or does allreduce just sum the gradients in which case:

and `ProcessGroup::allreduce()` to sum gradients.

1. DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
batch-per-process = 8
gradient = (8g/8) + (8g/8) = 2g
total-grad-distance = 0.1 * 2g* 5 = g
-> thus leave LR the same as single-GPU

If you maintain the same batch size between single GPU and DP/DDP, according to your calculations you do not need to adjust LR ?

During the backwards pass, gradients from each node are averaged.

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

2 Likes

More discussions can be found at Should we split batch_size according to ngpu_per_node when DistributedDataparallel

1 Like