Runtime error using Distributed with gloo

Hello everyone,

I am trying to use the PyTorch distributed package and gloo backend.

But, I get the following error.


Traceback (most recent call last):
  File "", line 169, in <module>
    init_processes(args.rank, size, run)
  File "", line 80, in init_processes
    dist.init_process_group(backend=backend, rank=rank, world_size=size)
  File "/usr/local/lib/python3.5/dist-packages/torch/distributed/", line 49, in init_process_group
    group_name, rank)
RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/] rp != nullptr. Unable to find address for: <My specified Ip> at /pytorch/torch/lib/THD/process_group/General.cpp:17

And another worker gets:

RuntimeError: [enforce fail at /pytorch/torch/lib/gloo/gloo/transport/tcp/] rp != nullptr. Unable to find address for: at /pytorch/torch/lib/THD/process_group/General.cpp:17

Below is how I initialize:

def init_processes(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """

    os.environ['MASTER_ADDR'] = <My specified Ip>
    os.environ['MASTER_PORT'] = '8888' if rank == 0 else '31566'

    print("Init Processes ->", 'backend:', backend, 'rank:', rank, 'MASTER_ADDR:', os.environ['MASTER_ADDR'],
          'MASTER_PORT:', os.environ['MASTER_PORT'])
    dist.init_process_group(backend=backend, rank=rank, world_size=size)
    fn(rank, size)

Any thoughts on what may be causing this or how I can fix it ? Thx
ps. I use docker env with python 3.5 and PyTorch 0.3.1 with cuda9.0.

1 Like

By default, both NCCL and Gloo backends will try to find the network interface to use for communication. However, this is not always guaranteed to be successful from our experiences. Therefore, if you encounter any problem on either backend not being able to find the correct network interface. You can try to set the following environment variables (each one applicable to its respective backend):

BTW, use ifconfig to find your first Ethernet interface.


Thanks! This saves my day.