Hi,
I’m trying to understand if variations in local loss values across GPUs during DDP training pose a problem.
For instance, when using two GPUs with the same batch size, but observing substantial differences in their respective local losses, does the DDP gradient averaging process lead to the GPU with the smaller local loss receiving an inflated gradient?