DDP Learning-Rate

I was a bit confused how DDP (with NCCL) reduces gradients and the effect this has on the learning-rate that needs to be set.

Would the below example be a correct way to interpret this → that DDP and DP should have the same learning-rate if scaled out to the same effective batch-size?

Assume set contains 80 samples
Single-gpu LR = 0.1
Total-grad-distance = LR * g * (samples/batch-size)

  1. Single-gpu
    batch = 8
    gradient = 8g/8 = g
    total-grad-distance = 0.1 * g * 10 = g

  2. DP (2-gpu, 1 node)
    batch = 16
    gradient = 16g/16 = g
    total-grad-distance = 0.1 * g * 5 = 0.5g
    → thus scale LR by 2

  3. DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
    batch-per-process = 8
    gradient = (8g/8) + (8g/8) / 2 = g
    total-grad-distance = 0.1 * g * 5 = 0.5g
    → thus scale LR by 2?

Or does allreduce just sum the gradients in which case:

and ProcessGroup::allreduce() to sum gradients.

  1. DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
    batch-per-process = 8
    gradient = (8g/8) + (8g/8) = 2g
    total-grad-distance = 0.1 * 2g* 5 = g
    → thus leave LR the same as single-GPU

If you maintain the same batch size between single GPU and DP/DDP, according to your calculations you do not need to adjust LR ?

PS:
In DDP grads are averaged: DistributedDataParallel — PyTorch master documentation

During the backwards pass, gradients from each node are averaged.

PPS: https://arxiv.org/pdf/1706.02677.pdf

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

2 Likes

More discussions can be found at Should we split batch_size according to ngpu_per_node when DistributedDataparallel

1 Like