Multiple papers about scaling up training to large batch sizes usually discuss how as the batch size increases, the learning rate should increase by the same factor (I can find links if needed, but I think it’s a known phenomenon).
As far as I can understand, this assumes the final loss is calculated as the sum of the loss of each entry in the batch.
Lately, I’ve discovered that when using DDP with multiple workers (say, on a single node), their gradients are averaged, which to my understanding does not require any modification to the learning rate, which seems simpler.
Should it be considered best practice, then, to always calculate the loss as the average of all entries in the batch, which would allow to not have to scale the learning rate? Or am I missing something?