'unhandled system error' when training with multi nodes

I met an error when I use DDP for multi node (two node, two GPUs each) training and ‘nccl’ backend (It runs perfect when I use ‘gloo’). The environment is Ubuntu 16.04+python3.5+pytorch1.5.0+cuda10.1.
My code is based on the demo code on the offical website to test distributed training.

def setup(rank, world_size):
    os.environ['NCCL_DEBUG'] = 'INFO'
    os.environ['NCCL_SOCKET_IFNAME'] = 'eno1'
    os.environ['NCCL_IB_DISABLE'] = '1'
    dist.init_process_group(
        "nccl", rank=rank, init_method='tcp://162.105.146.176:22222', world_size=world_size)

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))
 ...

The error code when using ‘nccl’ is as following,

ptwop-176:1755:1755 [0] NCCL INFO Bootstrap : Using [0]eno1:162.105.146.176<0>
ptwop-176:1755:1755 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

ptwop-176:1755:1755 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ptwop-176:1755:1755 [0] NCCL INFO NET/Socket : Using [0]eno1:162.105.146.176<0>
NCCL version 2.4.8+cuda10.1
ptwop-176:1756:1756 [1] NCCL INFO Bootstrap : Using [0]eno1:162.105.146.176<0>
ptwop-176:1756:1756 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

ptwop-176:1756:1756 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ptwop-176:1756:1756 [1] NCCL INFO NET/Socket : Using [0]eno1:162.105.146.176<0>
ptwop-176:1755:1870 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555
ptwop-176:1756:1871 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555

ptwop-176:1756:1871 [1] include/socket.h:390 NCCL WARN Connect to 162.105.146.178<35007> failed : No route to host
ptwop-176:1756:1871 [1] NCCL INFO bootstrap.cc:100 -> 2

ptwop-176:1756:1871 [1] NCCL INFO bootstrap.cc:337 -> 2
ptwop-176:1755:1869 [0] include/socket.h:390 NCCL WARN Connect to 162.105.146.178<54473> failed : No route to host
ptwop-176:1756:1871 [1] NCCL INFO init.cc:695 -> 2
ptwop-176:1755:1869 [0] NCCL INFO bootstrap.cc:100 -> 2
ptwop-176:1756:1871 [1] NCCL INFO init.cc:951 -> 2
ptwop-176:1755:1869 [0] NCCL INFO bootstrap.cc:226 -> 2
ptwop-176:1756:1871 [1] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
  File "test.py", line 73, in <module>
    run_demo(demo_basic, 2, 3)
  File "test.py", line 45, in run_demo
    join=True)
  File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/xukun/graph/multi_node/test.py", line 57, in demo_basic
    ddp_model = DDP(model.to(rank), device_ids=[rank])
  File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
    self.broadcast_bucket_size)
  File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8

What can I do to avoid this error?

From this log message: ptwop-176:1755:1869 [0] include/socket.h:390 NCCL WARN Connect to 162.105.146.178<54473> failed : No route to host. I’m assuming the other node’s ip is 162.105.146.178? Could you validate the following:

  1. See if the issue reproduces on a single-node multi-gpu setup.
  2. Can you ping 162.105.146.178 from 162.105.146.176?
1 Like

Yes, the nodes’ ips are 162.105.146.178 and 162.105.146.176.

  1. The issue cannot reproduce on a single-node multi-gpu setup, and everything runs well.
  2. The two nodes can ping each other successfully, and it runs well when I change the communication backend to “gloo” from “nccl” with nearly no code changed (minus some modification to make Tensors and model on cpu).

Thanks for validating. Which version of NCCL are you using? And could you validate that NCCL has been successfully installed on both nodes by running: https://github.com/NVIDIA/nccl-tests.

Here’s one way to see if nccl is installed on the node:

locate nccl| grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'