Parallel training on multiple GPU

I’m wondering how does the parallel training works (Distributed Data Parallel). I’ve been reading couple of blog posts and here is my understanding, I appreciate if you can correct me if I’m wrong.

  1. When the model is copied into multiple GPUs, the weights should all be the same. Is this correct?
  2. After each forward pass, each GPU computes the loss and its gradient individually. Then all of these gradients are aggregated and averaged and passed to the each GPU to update the weights.
    Is this correct?
  3. After averaging the gradients and updating the weights, all GPUs should have the same wights. Is this correct?

Also, is there any other method that is not doing the average which may cause the models to differ?
Is there any good blog posts that I can read about the detailed theory behind it?

Yes, that sounds like the correct understanding.

I believe this kind of gradient allreduce->broadcast with averaging is standard in the data-parallelism regime. However, other methods of utilizing multiple GPUs exist (section 2.1 of a recent paper provides an overview: https://arxiv.org/pdf/2201.12023.pdf), especially when the model activations become large and need to be split across multiple devices.

1 Like