When to use dist.all_gather()

Shuwei_Li · April 15, 2024, 8:19am

Hi,

Thanks for reading this post. Recently I got access to multiple gpus so I started to use distributed. I got a few questions regarding to all_gather() function:

If I do not use it, does it mean that my models will be updated separately on different gpus?
If I need to use it, can I just simply use it on the computed loss? In this way, I can get a conclusion version of loss which can help me update all the models on different gpus in the same way?
What about the BN layer? If I only use it on the loss, will the BN layer or other normalization layer updated correctly?

Thanks for your time!

ptrblck · April 15, 2024, 4:10pm

It seems you want to manually apply a distributed workflow using distributed operations.
I’m unsure why and would recommend starting with e.g. DistributedDataParallel first as it would take care of the gradient reduction etc. for you.