When to use dist.all_gather()

Hi,

Thanks for reading this post. Recently I got access to multiple gpus so I started to use distributed. I got a few questions regarding to all_gather() function:

  1. If I do not use it, does it mean that my models will be updated separately on different gpus?

  2. If I need to use it, can I just simply use it on the computed loss? In this way, I can get a conclusion version of loss which can help me update all the models on different gpus in the same way?

  3. What about the BN layer? If I only use it on the loss, will the BN layer or other normalization layer updated correctly?

Thanks for your time!

It seems you want to manually apply a distributed workflow using distributed operations.
I’m unsure why and would recommend starting with e.g. DistributedDataParallel first as it would take care of the gradient reduction etc. for you.