If you’re trying to average gradients across processes do this instead (it’s going to be faster than reduce + broadcast):
grad = param.grad
avg_grad = grad / num_processes
dist.all_reduce(avg_grad)
Also, you might want to use torch.nn.parallel.DistributedDataParallel instead. It will automatically optimize the transfers to give you better performance
I want to implement the parameter server architecture. So I need to send all gradients/parameters to parameter server and average them. Then, I send back to the worker. I find the reduce operation is similar with what I want, but it is different. And all_reduce looks like a operation that each worker communicates among workers, does this setting cost more communication cost? http://pytorch.org/tutorials/intermediate/dist_tuto.html
Does pytorch support parameter server architecture? Parameter server and workers, like mu’s paper