General question about distributed training. To my knowledge, you begin the epoch by synchronizing parameters across all workers and saving a copy. Then each worker trains for n epochs. Then each worker takes the difference between their new set of parameters and the set of parameters they started with when they synchronized. All workers send this difference to the parameter server which maintains the global model and averages all these gradients and then applies the average to the global set of parameters w = w-l(sum(grads)). Is this the same as averaging the local gradients instead? I was told that PyTorch distributed module does this. Could someone point me to any papers about averaging local gradients that are computed during the forward pass?