Hi,
I am running a simple application on two machines with 2 gpus each, it is throwing me an error. The application works fine on a single machine with 2gpus.
The NCCL info error info in here
dml4:26072:26072 [1] NCCL INFO Bootstrap : Using [0]XXXXXX<0> [1]enp0s20f0u1u6:169.254.95.120<0> [2]virbr0:192.168.122.1<0>
dml4:26071:26071 [0] NCCL INFO Bootstrap : Using [0]XXXXX<0> [1]enp0s20f0u1u6:169.254.95.120<0> [2]virbr0:XXXXX<0>
dml4:26072:26072 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
dml4:26071:26071 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
dml4:26072:26072 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB enp88s0:9.1.44.100<0>
dml4:26071:26071 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB enp88s0:9.1.44.100<0>
dml4:26072:26240 [1] NCCL INFO Setting affinity for GPU 1 to ffff,f00000ff,fff00000
dml4:26071:26242 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
dml4:26072:26240 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
dml4:26071:26242 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : NODE
dml4:26071:26242 [0] NCCL INFO Ring 00 : 1 -> 2 [receive] via NET/IB/0
dml4:26071:26242 [0] NCCL INFO Ring 00 : 2[0] -> 3[1] via direct shared memory
dml4:26072:26240 [1] NCCL INFO Ring 00 : 3 -> 0 [send] via NET/IB/0
dml4:26072:26240 [1] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
dml4:26072:26240 [1] NCCL INFO transport/net_ib.cc:601 -> 2
dml4:26072:26240 [1] NCCL INFO include/net.h:24 -> 2
dml4:26072:26240 [1] NCCL INFO transport/net.cc:360 -> 2
dml4:26072:26240 [1] NCCL INFO init.cc:669 -> 2
dml4:26072:26240 [1] NCCL INFO init.cc:815 -> 2
dml4:26072:26240 [1] NCCL INFO init.cc:951 -> 2
dml4:26072:26240 [1] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
dml4:26071:26242 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
dml4:26071:26242 [0] NCCL INFO transport/net_ib.cc:601 -> 2
dml4:26071:26242 [0] NCCL INFO include/net.h:24 -> 2
dml4:26071:26242 [0] NCCL INFO transport/net.cc:388 -> 2
dml4:26071:26242 [0] NCCL INFO init.cc:679 -> 2
dml4:26071:26242 [0] NCCL INFO init.cc:815 -> 2
dml4:26071:26242 [0] NCCL INFO init.cc:951 -> 2
dml4:26071:26242 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
File “conv_dist.py”, line 118, in
main()
File “conv_dist.py”, line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File “/work/tools/envs/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’)
File “work/tools/envs/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 158, in start_processes
while not context.join():
File “/work/tools/envs/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 119, in join
raise Exception(msg)
Exception:
– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/work/tools/envs/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 20, in _wrap
fn(i, *args)
File “/us4j4248/pt_dist/conv_dist.py”, line 75, in train
model = DDP(model, device_ids=[gpu])
File “/work/tools/envs/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 285, in init
self.broadcast_bucket_size)
File “/work/tools/envs/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
ps: I have removed the ip addresses above.
Thanks