Distributed training


I am training a network on two machines, each of them having 2 GPUs. I use the torch.distributed.launch utility to launch it. I followed the tutorial on https://pytorch.org/docs/stable/distributed.html. My question is the following.

When in my code I perform a step in the optimizer with optimizer.step(), do I have to gather the gradient from all the other machines with something like

     for param in self.net.parameters():
                   dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
                   param.grad.data /= size

Or I do not need to do this.

In this example of imagenet, they do not gather the gradients from other devices and it looks like it is built in in the optimizer.step( ), but in this example, they define a function average gradient

 def average_gradients(self):
        world_size = distributed.get_world_size()

        for p in self.net.parameters():
            group = distributed.new_group(ranks=list(range(world_size)))

            tensor = p.grad.data.cpu()

                tensor, op=distributed.reduce_op.SUM, group=group)

            tensor /= float(world_size)

            p.grad.data = tensor.to(self.device)

So I do not know if I need to do this or not.