I am training a network on two machines, each of them having 2 GPUs. I use the torch.distributed.launch utility to launch it. I followed the tutorial on https://pytorch.org/docs/stable/distributed.html. My question is the following.
When in my code I perform a step in the optimizer with optimizer.step(), do I have to gather the gradient from all the other machines with something like
for param in self.net.parameters(): dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM) param.grad.data /= size
Or I do not need to do this.
In this example of imagenet, they do not gather the gradients from other devices and it looks like it is built in in the optimizer.step( ), but in this example, they define a function average gradient
def average_gradients(self): world_size = distributed.get_world_size() for p in self.net.parameters(): group = distributed.new_group(ranks=list(range(world_size))) tensor = p.grad.data.cpu() distributed.all_reduce( tensor, op=distributed.reduce_op.SUM, group=group) tensor /= float(world_size) p.grad.data = tensor.to(self.device)
So I do not know if I need to do this or not.