RuntimeError: Connection reset by peer , when use NCCL backend

chenming · March 13, 2022, 11:50am

I run the following script named “index.py” on 2 machines with CUDA GPU installed.

import datetime
import os
import torch.distributed as dist
import sys
backend='nccl'
rank=int(sys.argv[2])
ifnames=["enp9s0",'eno1']
os.environ['GLOO_SOCKET_IFNAME']=ifnames[rank]
os.environ['NCCL_SOCKET_IFNAME']=ifnames[rank]
dist.init_process_group(backend=backend, rank=rank,init_method='tcp://192.168.1.2:23456',  world_size=2,timeout=datetime.timedelta(seconds=15))
print("ok")

on Machine A (master, with a GTX 1650):

python index.py 0

on Machine B(with a RTX 3060):

python index.py 1

Machine A prints ‘ok’ as expected.

But Machine B prints the following error:

Traceback (most recent call last):
  File "/mnt/sdc2/file/project/distributed/index.py", line 10, in <module>
    dist.init_process_group(backend=backend, rank=rank,init_method='tcp://192.168.1.2:23456',  world_size=2,timeout=datetime.timedelta(seconds=15))
  File "/root/miniconda3/envs/vision/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 611, in init_process_group
    default_pg._set_sequence_number_for_group()
RuntimeError: Connection reset by peer

If I changed backend into ‘gloo’ , then both Machine A and Machine B prints ‘ok’ as expected.

Any ideas on what is going wrong when backended by ‘nccl’ ?

rvarm1 · March 14, 2022, 10:41pm

It’s possible that NCCL takes a longer time to initialize, could you try to bump the timeout and see if that helps resolve the issue?

In addition, it might be useful to add a dist.barrier() call after init_process_group which will synchronize all ranks after they have completed initialization successfully, which might help debugging. In this particular case, it is also possible that machine “A” exits the script, which tears down the store hosted on rank 0, resulting in rank 1’s error.

chenming · March 18, 2022, 5:29am

Hi, thanks for your reply.

I read the source code of init_process_group. It calls dist.barrier() at its end automatically.

def init_process_group(
    backend,
    init_method=None,
    timeout=default_pg_timeout,
    world_size=-1,
    rank=-1,
    store=None,
    group_name="",
    pg_options=None,
):
    #......

    # barrier at the end to ensure that once we return from this method, all
    # process groups including global variables are updated correctly on all
    # ranks.
    if backend == Backend.MPI:
        # MPI backend doesn't use store.
        barrier()
    else:
        # Use store based barrier here since barrier() used a bunch of
        # default devices and messes up NCCL internal state.
        _store_based_barrier(rank, store, timeout)
        # Set sequence numbers for gloo and nccl process groups.
        if get_backend(default_pg) in [Backend.GLOO, Backend.NCCL]:
            default_pg._set_sequence_number_for_group()

I think adding another dist.barrier() call after init_process_group is not needed.