I was a bit confused how DDP (with NCCL) reduces gradients and the effect this has on the learningrate that needs to be set.
Would the below example be a correct way to interpret this > that DDP and DP should have the same learningrate if scaled out to the same effective batchsize?
Assume set contains 80 samples
Singlegpu LR = 0.1
Totalgraddistance = LR * g * (samples/batchsize)

Singlegpu
batch = 8
gradient = 8g/8 = g
totalgraddistance = 0.1 * g * 10 = g 
DP (2gpu, 1 node)
batch = 16
gradient = 16g/16 = g
totalgraddistance = 0.1 * g * 5 = 0.5g
> thus scale LR by 2 
DDP (2gpu, 1 node OR 1gpu, 2 nodes)
batchperprocess = 8
gradient = (8g/8) + (8g/8) / 2 = g
totalgraddistance = 0.1 * g * 5 = 0.5g
> thus scale LR by 2?
Or does allreduce just sum the gradients in which case:
and
ProcessGroup::allreduce()
to sum gradients.
 DDP (2gpu, 1 node OR 1gpu, 2 nodes)
batchperprocess = 8
gradient = (8g/8) + (8g/8) = 2g
totalgraddistance = 0.1 * 2g* 5 = g
> thus leave LR the same as singleGPU