I met an error when I use DDP for multi node (two node, two GPUs each) training and ‘nccl’ backend (It runs perfect when I use ‘gloo’). The environment is Ubuntu 16.04+python3.5+pytorch1.5.0+cuda10.1.
My code is based on the demo code on the offical website to test distributed training.
def setup(rank, world_size):
os.environ['NCCL_DEBUG'] = 'INFO'
os.environ['NCCL_SOCKET_IFNAME'] = 'eno1'
os.environ['NCCL_IB_DISABLE'] = '1'
dist.init_process_group(
"nccl", rank=rank, init_method='tcp://162.105.146.176:22222', world_size=world_size)
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
...
The error code when using ‘nccl’ is as following,
ptwop-176:1755:1755 [0] NCCL INFO Bootstrap : Using [0]eno1:162.105.146.176<0>
ptwop-176:1755:1755 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
ptwop-176:1755:1755 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ptwop-176:1755:1755 [0] NCCL INFO NET/Socket : Using [0]eno1:162.105.146.176<0>
NCCL version 2.4.8+cuda10.1
ptwop-176:1756:1756 [1] NCCL INFO Bootstrap : Using [0]eno1:162.105.146.176<0>
ptwop-176:1756:1756 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
ptwop-176:1756:1756 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
ptwop-176:1756:1756 [1] NCCL INFO NET/Socket : Using [0]eno1:162.105.146.176<0>
ptwop-176:1755:1870 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555
ptwop-176:1756:1871 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555
ptwop-176:1756:1871 [1] include/socket.h:390 NCCL WARN Connect to 162.105.146.178<35007> failed : No route to host
ptwop-176:1756:1871 [1] NCCL INFO bootstrap.cc:100 -> 2
ptwop-176:1756:1871 [1] NCCL INFO bootstrap.cc:337 -> 2
ptwop-176:1755:1869 [0] include/socket.h:390 NCCL WARN Connect to 162.105.146.178<54473> failed : No route to host
ptwop-176:1756:1871 [1] NCCL INFO init.cc:695 -> 2
ptwop-176:1755:1869 [0] NCCL INFO bootstrap.cc:100 -> 2
ptwop-176:1756:1871 [1] NCCL INFO init.cc:951 -> 2
ptwop-176:1755:1869 [0] NCCL INFO bootstrap.cc:226 -> 2
ptwop-176:1756:1871 [1] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
File "test.py", line 73, in <module>
run_demo(demo_basic, 2, 3)
File "test.py", line 45, in run_demo
join=True)
File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/xukun/graph/multi_node/test.py", line 57, in demo_basic
ddp_model = DDP(model.to(rank), device_ids=[rank])
File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
self.broadcast_bucket_size)
File "/home/xukun/setsuna/lib/python3.5/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
What can I do to avoid this error?