My train.py works very well when I execute python -m torch.distributed.launch --nproc_per_node=2 train.py
.
However, when I try to execute it between two nodes, I failed, the debug log is below:
$ NCCL_DEBUG=TRACE python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.144.160" --master_port=1234 train.py acgtyrant aa
WARNING:root:commit_id exists, so backup commit_id in model_directory
WARNING:root:override commit_id in model_directory
INFO:root:config.py exists, so backup config in model_directory
INFO:root:override config in model_directory
aa:10367:10367 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10367:10367 [0] INFO NET/IB : Using interface enp3s0 for sideband communication
aa:10367:10367 [0] INFO Using internal Network Socket
aa:10367:10367 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
aa:10367:10367 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10367:10367 [0] INFO NET : Using interface docker0:172.17.0.1<0>
aa:10367:10367 [0] INFO NET/Socket : 2 interfaces found
NCCL version 2.2.13+cuda9.0
aa:10367:10881 [0] INFO comm 0x7f8afc0b0cc0 rank 0 nranks 4
aa:10368:10368 [1] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10368:10368 [1] INFO NET/IB : Using interface enp3s0 for sideband communication
aa:10368:10368 [1] INFO Using internal Network Socket
aa:10368:10368 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
aa:10368:10368 [1] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10368:10368 [1] INFO NET : Using interface docker0:172.17.0.1<0>
aa:10368:10368 [1] INFO NET/Socket : 2 interfaces found
aa:10368:10886 [1] INFO comm 0x7f158c0b0cc0 rank 1 nranks 4
aa:10367:10881 [0] INFO Using 256 threads
aa:10367:10881 [0] INFO Min Comp Cap 5
aa:10367:10881 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
aa:10367:10881 [0] INFO Ring 00 : 0 1 2 3
aa:10367:10881 [0] INFO Ring 01 : 0 1 2 3
aa:10367:10881 [0] INFO 3 -> 0 via NET/Socket/0
aa:10367:10881 [0] INFO 0[10367] -> 1[10368] via direct shared memory
aa:10367:10881 [0] INFO 3 -> 0 via NET/Socket/1
aa:10367:10881 [0] INFO 0[10367] -> 1[10368] via direct shared memory
aa:10368:10886 [1] include/socket.h:360 WARN Call to connect timeout : Connection refused
aa:10368:10886 [1] INFO transport/net_socket.cu:118 -> 2
aa:10368:10886 [1] INFO include/net.h:32 -> 2 [Net]
aa:10368:10886 [1] INFO transport/net.cu:266 -> 2
aa:10368:10886 [1] INFO init.cu:475 -> 2
aa:10368:10886 [1] INFO init.cu:536 -> 2
aa:10368:10886 [1] INFO misc/group.cu:70 -> 2 [Async thread]
Traceback (most recent call last):
File "train.py", line 285, in <module>
main()
File "train.py", line 243, in main
model, device_ids=[local_rank], output_device=local_rank)
File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/deprecated/distributed.py", line 135, in __init__
self.broadcast_bucket_size)
File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/deprecated/distributed.py", line 252, in _dist_broadcast_coalesced
dist.broadcast(flat_tensors, 0)
File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/distributed/deprecated/__init__.py", line 286, in broadcast
return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:301, unhandled system error
Any idea? Thank you!
The 243th line of train.py
is paralled_model = torch.nn.parallel.deprecated.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
The same phenomenon still happens too when I use torch.distributed
instead of torch.distributed.deprecated
:
$ CUDA_VISIBLE_DEVICE=1 NCCL_DEBUG=TRACE python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="192.168.144.160" --master_port=1234 train.py
WARNING:root:commit_id exists, so backup commit_id in model_directory
WARNING:root:override commit_id in model_directory
INFO:root:config.py exists, so backup config in model_directory
INFO:root:override config in model_directory
aa:15921:15921 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:15921:15921 [0] INFO NET/IB : Using interface enp3s0 for sideband communication
aa:15921:15921 [0] INFO Using internal Network Socket
aa:15921:15921 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
aa:15921:15921 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:15921:15921 [0] INFO NET : Using interface docker0:172.17.0.1<0>
aa:15921:15921 [0] INFO NET/Socket : 2 interfaces found
NCCL version 2.2.13+cuda9.0
aa:15921:15959 [0] INFO comm 0x7fe0600b0cc0 rank 0 nranks 2
aa:15921:15959 [0] INFO Using 256 threads
aa:15921:15959 [0] INFO Min Comp Cap 5
aa:15921:15959 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
aa:15921:15959 [0] INFO Ring 00 : 0 1
aa:15921:15959 [0] INFO Ring 01 : 0 1
aa:15921:15959 [0] INFO 1 -> 0 via NET/Socket/0
aa:15921:15959 [0] INFO 1 -> 0 via NET/Socket/1
aa:15921:15959 [0] include/socket.h:360 WARN Call to connect timeout : Connection refused
aa:15921:15959 [0] INFO transport/net_socket.cu:118 -> 2
aa:15921:15959 [0] INFO include/net.h:32 -> 2 [Net]
aa:15921:15959 [0] INFO transport/net.cu:266 -> 2
aa:15921:15959 [0] INFO init.cu:475 -> 2
aa:15921:15959 [0] INFO init.cu:536 -> 2
aa:15921:15959 [0] INFO misc/group.cu:70 -> 2 [Async thread]
Traceback (most recent call last):
File "train.py", line 285, in <module>
main()
File "train.py", line 243, in main
model, device_ids=[local_rank], output_device=local_rank)
File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 150, in __init__
self.broadcast_bucket_size)
File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 264, in _dist_broadcast_coalesced
dist._dist_broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:271, unhandled system error