I’m trying to send model parameters from one machine to another using torch.distributed, and it boils down to
import torch.distributed as dist
# Sending from Machine 0
for param in model.parameters():
dist.isend(param, 1)
# Receiving on Machine 1
for param in model.parameters():
other_param = 0 * param.clone()
dist.irecv(param, 1)
assert other_param == param # if models equal
Will I receive the parameters I expect from this? In my tests they’re equal but I’m unsure if this will be general in other scenarios.
This boils down to sending a some tag along with the tensor to identify which tensor you are sending. In mpi4py, this is possible with their tag interface:
http://mpi4py.readthedocs.io/en/stable/overview.html?highlight=tag#point-to-point-communications