Strange behaviour of GLOO tcp transport

ZiyiZhu · April 24, 2020, 2:53am

This is different from the RPC problem. Back then I was using Google Cloud VMs. The torch.distributed and RPC worked fine there.

However, just recently we built up new servers with GPU in our lab and connect them using an electrical packet switch. They can ping each other using the internal IP. For me now it is 10.1.1.101 for rank 0 and 10.1.1.102 for rank 1. So I run the following:

import torch.distributed as dist

# on rank 0
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://10.1.1.101:29500',
    rank = 0,
    world_size = 2
)

import torch.distributed as dist

# on rank 1
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://10.1.1.101:29500',
    rank = 1,
    world_size = 2
)

However, it failed with

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-532df564c254> in <module>
      6     init_method = 'tcp://10.1.1.101:29500',
      7     rank = 1,
----> 8     world_size = 2
      9 )

~/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    401             store,
    402             group_name=group_name,
--> 403             timeout=timeout)
    404 
    405     _pg_group_ranks[_default_pg] = {i: i for i in range(_default_pg.size())}

~/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _new_process_group_helper(world_size, rank, group_ranks, backend, store, group_name, timeout)
    469                 rank,
    470                 world_size,
--> 471                 timeout=timeout)
    472             _pg_map[pg] = (Backend.GLOO, store)
    473             _pg_names[pg] = group_name

RuntimeError: [/opt/conda/conda-bld/pytorch_1587428398394/work/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [127.0.0.1]:31662: Connection refused

Which I guess is the same problem for @Oleg_Ivanov too. In terms of

export GLOO_SOCKET_IFNAME=eno2

Should I simply do it in any terminal? eno2 is my NIC.

Please let me know if you have any thoughts. Thank you very much for your help!