I’m wondering how does the parallel training works (Distributed Data Parallel). I’ve been reading couple of blog posts and here is my understanding, I appreciate if you can correct me if I’m wrong.
- When the model is copied into multiple GPUs, the weights should all be the same. Is this correct?
- After each forward pass, each GPU computes the loss and its gradient individually. Then all of these gradients are aggregated and averaged and passed to the each GPU to update the weights.
Is this correct? - After averaging the gradients and updating the weights, all GPUs should have the same wights. Is this correct?
Also, is there any other method that is not doing the average which may cause the models to differ?
Is there any good blog posts that I can read about the detailed theory behind it?