I was a bit confused how DDP (with NCCL) reduces gradients and the effect this has on the learning-rate that needs to be set.
Would the below example be a correct way to interpret this → that DDP and DP should have the same learning-rate if scaled out to the same effective batch-size?
Assume set contains 80 samples
Single-gpu LR = 0.1
Total-grad-distance = LR * g * (samples/batch-size)
-
Single-gpu
batch = 8
gradient = 8g/8 = g
total-grad-distance = 0.1 * g * 10 = g -
DP (2-gpu, 1 node)
batch = 16
gradient = 16g/16 = g
total-grad-distance = 0.1 * g * 5 = 0.5g
→ thus scale LR by 2 -
DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
batch-per-process = 8
gradient = (8g/8) + (8g/8) / 2 = g
total-grad-distance = 0.1 * g * 5 = 0.5g
→ thus scale LR by 2?
Or does allreduce just sum the gradients in which case:
and
ProcessGroup::allreduce()
to sum gradients.
- DDP (2-gpu, 1 node OR 1-gpu, 2 nodes)
batch-per-process = 8
gradient = (8g/8) + (8g/8) = 2g
total-grad-distance = 0.1 * 2g* 5 = g
→ thus leave LR the same as single-GPU