Hi!
I have four machines, and each machine has one GPU device. I want to train my model use four GPU devices but failed.
Below are the information of my machines.(node name with ip)
n100: 172.22.99.10
n101: 172.22.99.11
n102: 172.22.99.12
n104: 172.22.99.14
In my program, I use the Gloo backend. If I run the program with 3 nodes: n100, n101, n102, the program works well. But when I use all the nodes, I get the following error:
fanxp@n100:~/vscode/pytorch_test$ ./parallel_deepAR.py --rank 0 --world-size 4
Traceback (most recent call last):
File "./parallel_deepAR.py", line 472, in <module>
init_process(args.rank, args.world_size, run, 'gloo', args.ip, args.port)
File "./parallel_deepAR.py", line 313, in init_process
init_method='tcp://{}:{}'.format(ip, port), rank=rank, world_size=size)
File "/simm/home/fanxp/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 410, in init_process_group
timeout=timeout)
File "/simm/home/fanxp/.local/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 478, in _new_process_group_helper
timeout=timeout)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:207] address family mismatch
I think the node n104
may have a different address family, which cause the error. But I don’t know how to solve this.
Some additional informations
- the
ifconfig
output of the internet interface on each node
fanxp@n100:~/vscode/pytorch_test$ ifconfig eth5
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.22.99.10 netmask 255.255.255.0 broadcast 172.22.99.255
inet6 fe80::1602:ecff:fe69:ef5d prefixlen 64 scopeid 0x20<link>
inet6 2400:dd02:100c:3199:1602:ecff:fe69:ef5d prefixlen 64 scopeid 0x0<global>
ether 14:02:ec:69:ef:5d txqueuelen 1000 (Ethernet)
RX packets 472256109 bytes 701421415319 (701.4 GB)
RX errors 0 dropped 5470 overruns 0 frame 0
TX packets 553043129 bytes 818712088574 (818.7 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
fanxp@n101:~/vscode/pytorch_test$ ifconfig eth5
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.22.99.11 netmask 255.255.255.0 broadcast 172.22.99.255
inet6 fe80::211:aff:fe6c:2345 prefixlen 64 scopeid 0x20<link>
inet6 2400:dd02:100c:3199:211:aff:fe6c:2345 prefixlen 64 scopeid 0x0<global>
ether 00:11:0a:6c:23:45 txqueuelen 1000 (Ethernet)
RX packets 373027705 bytes 535914116118 (535.9 GB)
RX errors 0 dropped 1720 overruns 0 frame 0
TX packets 87419537 bytes 80820382770 (80.8 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
fanxp@n102:~/vscode/pytorch_test$ ifconfig eth5
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.22.99.12 netmask 255.255.255.0 broadcast 172.22.99.255
inet6 fe80::211:aff:fe6c:2325 prefixlen 64 scopeid 0x20<link>
inet6 2400:dd02:100c:3199:211:aff:fe6c:2325 prefixlen 64 scopeid 0x0<global>
ether 00:11:0a:6c:23:25 txqueuelen 1000 (Ethernet)
RX packets 9676903 bytes 10243508657 (10.2 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8458287 bytes 7559359606 (7.5 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
fanxp@n104:~/vscode/pytorch_test$ ifconfig ens1f1
ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.22.99.14 netmask 255.255.255.0 broadcast 172.22.99.255
inet6 2400:dd02:100c:3199:1602:ecff:fe72:8ae8 prefixlen 64 scopeid 0x0<global>
inet6 fe80::1602:ecff:fe72:8ae8 prefixlen 64 scopeid 0x20<link>
ether 14:02:ec:72:8a:e8 txqueuelen 1000 (Ethernet)
RX packets 6220778 bytes 5698014724 (5.6 GB)
RX errors 0 dropped 1166 overruns 0 frame 0
TX packets 12621081 bytes 14816572590 (14.8 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
- all the network interface use the infiniband
- the source code is too complex, so it’s not good to provide the code. I think the code to initialize process may be helpful
def init_process(rank, size, fn, backend='gloo', ip=None, port=None):
""" Initialize the distributed environment. """
# os.environ['MASTER_ADDR'] = '172.22.99.10'
# os.environ['MASTER_PORT'] = '29500'
# dist.init_process_group(backend, rank=rank, world_size=size)
dist.init_process_group(
backend=backend,
init_method='tcp://{}:{}'.format(ip, port), rank=rank, world_size=size)
fn(rank, size)
- The master used in Gloo backend
address: 172.22.99.10 port: 20000 - pytorch version
PyTorch version: 1.3.0a0+ee77ccb
Is debug build: No
CUDA used to build PyTorch: 10.1.243
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.10.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla K40c
Nvidia driver version: 440.33.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.2
Versions of relevant libraries:
[pip] numpy==1.14.5
[conda] Could not collect