similar topic to this question: do optimizers work transparently in multiprocess runs or do I need to average the gradients of each process manually?
The imagenet example in the pytorch/examples repo does not do explicit gradient averaging between processes, but the example on distributed training in pytorch’s tutorials does.
Thanks a lot!
I have a similar question here. I simultaneously opened a query in pytorch/fairseq#779 to which the response was that there is built in averaging.
How about trying some black box experiments to figure out?
If you use vanilla multiprocessing you’ll have to do this yourself. If you use it in combination with
torch.nn.parallel.DistributedDataParallel then gradient synchronization and averaging is done for you. Also see the documentation on
torch.distributed for more information.
@pietern can you show me in the source where the average is done? for the life of me i’ve been all over the codebase and i can’t find it.
i’m looking here
@makslevental It’s done not in the autograd code but in the DDP reducer code: