Connection reset by peer from torch.distributed.recv


I am using torch distributed with gloo backend (because I need peer to peer communication). While running my test script, I got a ‘Connection reset by peer’ error while dist.recv is called. Any clue on what causes this?

I am using mpi to launch 2 processes and my script is pasted as below:

def run(rank, size):
tensor = torch.zeros(1).cuda()
if rank == 0:
tensor += 1
# Send the tensor to process 1
dist.send(tensor=tensor, dst=1)
# Receive tensor from process 0
dist.recv(tensor=tensor, src=0)
print('Rank ', rank, ’ has data ', tensor[0])
def init_processes(rank, size, addr, fn, backend):
os.environ[‘MASTER_ADDR’] = addr
os.environ[‘MASTER_PORT’] = ‘12345’
my_rank = os.environ[‘OMPI_COMM_WORLD_RANK’]
dist.init_process_group(backend, rank=my_rank, world_size=size)
fn(rank, size)
if name == “main”:
hostname = socket.gethostname()
addr = socket.gethostbyname(hostname)
size = 2
my_rank = os.environ[‘OMPI_COMM_WORLD_RANK’]
init_processes(my_rank, size, addr, run, ‘gloo’)

The error I got is as below:

File “/home/xzhu1900/anaconda3/envs/test_py37/lib/python3.7/site-packages/torch/distributed/”, line 712, in recv
pg.recv([tensor], src, tag).wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/] Read error []:48028: Connection reset by peer