Nccl error in torch._C._dist_broadcast(tensor, src, group) when train in two nodes

acgtyrant · November 5, 2018, 7:29am

My train.py works very well when I execute python -m torch.distributed.launch --nproc_per_node=2 train.py.

However, when I try to execute it between two nodes, I failed, the debug log is below:

$ NCCL_DEBUG=TRACE python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.144.160" --master_port=1234 train.py                  acgtyrant  aa

WARNING:root:commit_id exists, so backup commit_id in model_directory
WARNING:root:override commit_id in model_directory
INFO:root:config.py exists, so backup config in model_directory
INFO:root:override config in model_directory
aa:10367:10367 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10367:10367 [0] INFO NET/IB : Using interface enp3s0 for sideband communication
aa:10367:10367 [0] INFO Using internal Network Socket
aa:10367:10367 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
aa:10367:10367 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10367:10367 [0] INFO NET : Using interface docker0:172.17.0.1<0>
aa:10367:10367 [0] INFO NET/Socket : 2 interfaces found
NCCL version 2.2.13+cuda9.0
aa:10367:10881 [0] INFO comm 0x7f8afc0b0cc0 rank 0 nranks 4
aa:10368:10368 [1] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10368:10368 [1] INFO NET/IB : Using interface enp3s0 for sideband communication
aa:10368:10368 [1] INFO Using internal Network Socket
aa:10368:10368 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
aa:10368:10368 [1] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:10368:10368 [1] INFO NET : Using interface docker0:172.17.0.1<0>
aa:10368:10368 [1] INFO NET/Socket : 2 interfaces found
aa:10368:10886 [1] INFO comm 0x7f158c0b0cc0 rank 1 nranks 4
aa:10367:10881 [0] INFO Using 256 threads
aa:10367:10881 [0] INFO Min Comp Cap 5
aa:10367:10881 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
aa:10367:10881 [0] INFO Ring 00 :    0   1   2   3
aa:10367:10881 [0] INFO Ring 01 :    0   1   2   3
aa:10367:10881 [0] INFO 3 -> 0 via NET/Socket/0
aa:10367:10881 [0] INFO 0[10367] -> 1[10368] via direct shared memory
aa:10367:10881 [0] INFO 3 -> 0 via NET/Socket/1
aa:10367:10881 [0] INFO 0[10367] -> 1[10368] via direct shared memory

aa:10368:10886 [1] include/socket.h:360 WARN Call to connect timeout : Connection refused
aa:10368:10886 [1] INFO transport/net_socket.cu:118 -> 2
aa:10368:10886 [1] INFO include/net.h:32 -> 2 [Net]
aa:10368:10886 [1] INFO transport/net.cu:266 -> 2
aa:10368:10886 [1] INFO init.cu:475 -> 2
aa:10368:10886 [1] INFO init.cu:536 -> 2
aa:10368:10886 [1] INFO misc/group.cu:70 -> 2 [Async thread]
Traceback (most recent call last):
  File "train.py", line 285, in <module>
    main()
  File "train.py", line 243, in main
    model, device_ids=[local_rank], output_device=local_rank)
  File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/deprecated/distributed.py", line 135, in __init__
    self.broadcast_bucket_size)
  File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/deprecated/distributed.py", line 252, in _dist_broadcast_coalesced
    dist.broadcast(flat_tensors, 0)
  File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/distributed/deprecated/__init__.py", line 286, in broadcast
    return torch._C._dist_broadcast(tensor, src, group)
RuntimeError: NCCL error in: /pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:301, unhandled system error

Any idea? Thank you!

The 243th line of train.py is paralled_model = torch.nn.parallel.deprecated.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

The same phenomenon still happens too when I use torch.distributed instead of torch.distributed.deprecated:

$ CUDA_VISIBLE_DEVICE=1 NCCL_DEBUG=TRACE python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="192.168.144.160" --master_port=1234 train.py
WARNING:root:commit_id exists, so backup commit_id in model_directory
WARNING:root:override commit_id in model_directory
INFO:root:config.py exists, so backup config in model_directory
INFO:root:override config in model_directory
aa:15921:15921 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:15921:15921 [0] INFO NET/IB : Using interface enp3s0 for sideband communication
aa:15921:15921 [0] INFO Using internal Network Socket
aa:15921:15921 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
aa:15921:15921 [0] INFO NET : Using interface enp3s0:192.168.144.160<0>
aa:15921:15921 [0] INFO NET : Using interface docker0:172.17.0.1<0>
aa:15921:15921 [0] INFO NET/Socket : 2 interfaces found
NCCL version 2.2.13+cuda9.0
aa:15921:15959 [0] INFO comm 0x7fe0600b0cc0 rank 0 nranks 2
aa:15921:15959 [0] INFO Using 256 threads
aa:15921:15959 [0] INFO Min Comp Cap 5
aa:15921:15959 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
aa:15921:15959 [0] INFO Ring 00 :    0   1
aa:15921:15959 [0] INFO Ring 01 :    0   1
aa:15921:15959 [0] INFO 1 -> 0 via NET/Socket/0
aa:15921:15959 [0] INFO 1 -> 0 via NET/Socket/1

aa:15921:15959 [0] include/socket.h:360 WARN Call to connect timeout : Connection refused
aa:15921:15959 [0] INFO transport/net_socket.cu:118 -> 2
aa:15921:15959 [0] INFO include/net.h:32 -> 2 [Net]
aa:15921:15959 [0] INFO transport/net.cu:266 -> 2
aa:15921:15959 [0] INFO init.cu:475 -> 2
aa:15921:15959 [0] INFO init.cu:536 -> 2
aa:15921:15959 [0] INFO misc/group.cu:70 -> 2 [Async thread]
Traceback (most recent call last):
  File "train.py", line 285, in <module>
    main()
  File "train.py", line 243, in main
    model, device_ids=[local_rank], output_device=local_rank)
  File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 150, in __init__
    self.broadcast_bucket_size)
  File "/mapbar/acgtyrant/Projects/drn/.env/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 264, in _dist_broadcast_coalesced
    dist._dist_broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:271, unhandled system error

acgtyrant · November 6, 2018, 5:34am

This maybe a bug from nccl: https://github.com/NVIDIA/nccl-tests/issues/15

acgtyrant · November 7, 2018, 5:12am

Use NCCL_SOCKET_IFNAME to specify the ip interface.

lolongcovas · December 14, 2018, 11:39pm

after finding the NCCL_SOCKET_IFNAME flag, what you have to do is:

ifconfig
# check the ethernet interface name: e.g. eth0
NCCL_SOCKET_IFNAME=eth0 python your_script.py parameters

best