RuntimeError: address family mismatch when use 'gloo' backend

Hi!
I have four machines, and each machine has one GPU device. I want to train my model use four GPU devices but failed.
Below are the information of my machines.(node name with ip)

n100: 172.22.99.10
n101: 172.22.99.11
n102: 172.22.99.12
n104: 172.22.99.14

In my program, I use the Gloo backend. If I run the program with 3 nodes: n100, n101, n102, the program works well. But when I use all the nodes, I get the following error:

fanxp@n100:~/vscode/pytorch_test$ ./parallel_deepAR.py --rank 0 --world-size 4
Traceback (most recent call last):
  File "./parallel_deepAR.py", line 472, in <module>
    init_process(args.rank, args.world_size, run, 'gloo', args.ip, args.port)
  File "./parallel_deepAR.py", line 313, in init_process
    init_method='tcp://{}:{}'.format(ip, port), rank=rank, world_size=size)
  File "/simm/home/fanxp/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 410, in init_process_group
    timeout=timeout)
  File "/simm/home/fanxp/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 478, in _new_process_group_helper
    timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:207] address family mismatch

I think the node n104 may have a different address family, which cause the error. But I don’t know how to solve this.
Some additional informations

  • the ifconfig output of the internet interface on each node
fanxp@n100:~/vscode/pytorch_test$ ifconfig eth5
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.22.99.10  netmask 255.255.255.0  broadcast 172.22.99.255
        inet6 fe80::1602:ecff:fe69:ef5d  prefixlen 64  scopeid 0x20<link>
        inet6 2400:dd02:100c:3199:1602:ecff:fe69:ef5d  prefixlen 64  scopeid 0x0<global>
        ether 14:02:ec:69:ef:5d  txqueuelen 1000  (Ethernet)
        RX packets 472256109  bytes 701421415319 (701.4 GB)
        RX errors 0  dropped 5470  overruns 0  frame 0
        TX packets 553043129  bytes 818712088574 (818.7 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
fanxp@n101:~/vscode/pytorch_test$ ifconfig eth5
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.22.99.11  netmask 255.255.255.0  broadcast 172.22.99.255
        inet6 fe80::211:aff:fe6c:2345  prefixlen 64  scopeid 0x20<link>
        inet6 2400:dd02:100c:3199:211:aff:fe6c:2345  prefixlen 64  scopeid 0x0<global>
        ether 00:11:0a:6c:23:45  txqueuelen 1000  (Ethernet)
        RX packets 373027705  bytes 535914116118 (535.9 GB)
        RX errors 0  dropped 1720  overruns 0  frame 0
        TX packets 87419537  bytes 80820382770 (80.8 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
fanxp@n102:~/vscode/pytorch_test$ ifconfig eth5
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.22.99.12  netmask 255.255.255.0  broadcast 172.22.99.255
        inet6 fe80::211:aff:fe6c:2325  prefixlen 64  scopeid 0x20<link>
        inet6 2400:dd02:100c:3199:211:aff:fe6c:2325  prefixlen 64  scopeid 0x0<global>
        ether 00:11:0a:6c:23:25  txqueuelen 1000  (Ethernet)
        RX packets 9676903  bytes 10243508657 (10.2 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8458287  bytes 7559359606 (7.5 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
fanxp@n104:~/vscode/pytorch_test$ ifconfig ens1f1
ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.22.99.14  netmask 255.255.255.0  broadcast 172.22.99.255
        inet6 2400:dd02:100c:3199:1602:ecff:fe72:8ae8  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::1602:ecff:fe72:8ae8  prefixlen 64  scopeid 0x20<link>
        ether 14:02:ec:72:8a:e8  txqueuelen 1000  (Ethernet)
        RX packets 6220778  bytes 5698014724 (5.6 GB)
        RX errors 0  dropped 1166  overruns 0  frame 0
        TX packets 12621081  bytes 14816572590 (14.8 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
  • all the network interface use the infiniband
  • the source code is too complex, so it’s not good to provide the code. I think the code to initialize process may be helpful
def init_process(rank, size, fn, backend='gloo', ip=None, port=None):
    """ Initialize the distributed environment. """
    # os.environ['MASTER_ADDR'] = '172.22.99.10'
    # os.environ['MASTER_PORT'] = '29500'
    # dist.init_process_group(backend, rank=rank, world_size=size)
    dist.init_process_group(
        backend=backend,
        init_method='tcp://{}:{}'.format(ip, port), rank=rank, world_size=size)
    fn(rank, size)
  • The master used in Gloo backend
    address: 172.22.99.10 port: 20000
  • pytorch version
    PyTorch version: 1.3.0a0+ee77ccb
    Is debug build: No
    CUDA used to build PyTorch: 10.1.243
    OS: Ubuntu 18.04.1 LTS
    GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
    CMake version: version 3.10.2
    Python version: 3.6
    Is CUDA available: Yes
    CUDA runtime version: 10.1.243
    GPU models and configuration: GPU 0: Tesla K40c
    Nvidia driver version: 440.33.01
    cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.2
    Versions of relevant libraries:
    [pip] numpy==1.14.5
    [conda] Could not collect

Can you try specifying GLOO_SOCKET_IFNAME to select the appropriate interface on each node as described here: https://pytorch.org/docs/stable/distributed.html#choosing-the-network-interface-to-use?