Runtime error using Distributed with gloo

CynthiaMY · April 18, 2018, 3:24am

Hello everyone,

I am trying to use the PyTorch distributed package and gloo backend.

But, I get the following error.

Master:

Traceback (most recent call last):
  File "Distributed.py", line 169, in <module>
    init_processes(args.rank, size, run)
  File "Distributed.py", line 80, in init_processes
    dist.init_process_group(backend=backend, rank=rank, world_size=size)
  File "/usr/local/lib/python3.5/dist-packages/torch/distributed/__init__.py", line 49, in init_process_group
    group_name, rank)
RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: <My specified Ip> at /pytorch/torch/lib/THD/process_group/General.cpp:17

And another worker gets:

RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: 10.37.0.1 at /pytorch/torch/lib/THD/process_group/General.cpp:17

Below is how I initialize:

def init_processes(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """

    os.environ['MASTER_ADDR'] = <My specified Ip>
    os.environ['MASTER_PORT'] = '8888' if rank == 0 else '31566'

    print("Init Processes ->", 'backend:', backend, 'rank:', rank, 'MASTER_ADDR:', os.environ['MASTER_ADDR'],
          'MASTER_PORT:', os.environ['MASTER_PORT'])
    dist.init_process_group(backend=backend, rank=rank, world_size=size)
    fn(rank, size)

Any thoughts on what may be causing this or how I can fix it ? Thx
ps. I use docker env with python 3.5 and PyTorch 0.3.1 with cuda9.0.

ifgovh · March 1, 2019, 6:58am

By default, both NCCL and Gloo backends will try to find the network interface to use for communication. However, this is not always guaranteed to be successful from our experiences. Therefore, if you encounter any problem on either backend not being able to find the correct network interface. You can try to set the following environment variables (each one applicable to its respective backend):

NCCL_SOCKET_IFNAME , for example export NCCL_SOCKET_IFNAME=eth0
GLOO_SOCKET_IFNAME , for example export GLOO_SOCKET_IFNAME=eth0
https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization

BTW, use ifconfig to find your first Ethernet interface.

platero · July 31, 2020, 6:44pm

Thanks! This saves my day.