Sending many tensors with isend – possible identify which tensor sent?

stsievert · September 14, 2017, 7:35pm

I’m trying to send model parameters from one machine to another using torch.distributed, and it boils down to

import torch.distributed as dist

# Sending from Machine 0
for param in model.parameters():
	dist.isend(param, 1)
	
# Receiving on Machine 1
for param in model.parameters():
	other_param = 0 * param.clone()
	dist.irecv(param, 1)
    assert other_param == param  # if models equal

Will I receive the parameters I expect from this? In my tests they’re equal but I’m unsure if this will be general in other scenarios.

This boils down to sending a some tag along with the tensor to identify which tensor you are sending. In mpi4py, this is possible with their tag interface:

http://mpi4py.readthedocs.io/en/stable/overview.html?highlight=tag#point-to-point-communications

ngimel · September 14, 2017, 10:37pm

For MPI backend at least, MPI non-overtaking guarantee should take care of messages being received in the correct order: http://mpi-forum.org/docs/mpi-2.2/mpi22-report/node54.htm#Node54, http://mpi-forum.org/docs/mpi-2.2/mpi22-report/node61.htm#Node61.

bapi · January 23, 2020, 9:12am

Hi! BTW, if it worked for your purpose, could you do any optimization over this for sending the model parameters across the machines, for example, coalesced communication?