Hi,
I am using DataParallel in a SGE enviroment with cudatoolkit 10.1 and NCCL version 2.7. I get the following error
/miniconda3/envs/cs21/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 56, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 3: internal error
The NCCL debug info is as follows
node20:22347:22347 [0] NCCL INFO Bootstrap : Using [0]ib0:10.10.9.220<0>
node20:22347:22347 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
node20:22347:22347 [0] NCCL INFO NET/IB : Using [0]qib0:1/IB ; OOB ib0:10.10.9.220<0>
node20:22347:22347 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda10.1
node20:22347:22436 [1] graph/xml.h:77 NCCL WARN Attribute class of node nic not found
node20:22347:22436 [1] NCCL INFO graph/topo.cc:312 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:348 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:395 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:467 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:570 -> 3
node20:22347:22435 [0] graph/xml.h:77 NCCL WARN Attribute class of node nic not found
node20:22347:22436 [1] NCCL INFO init.cc:581 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:312 -> 3
node20:22347:22436 [1] NCCL INFO init.cc:840 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:348 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:395 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:467 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:570 -> 3
node20:22347:22435 [0] NCCL INFO init.cc:581 -> 3
node20:22347:22435 [0] NCCL INFO init.cc:840 -> 3
node20:22347:22436 [1] NCCL INFO group.cc:73 -> 3 [Async thread]
node20:22347:22435 [0] NCCL INFO group.cc:73 -> 3 [Async thread]
node20:22347:22347 [0] NCCL INFO init.cc:906 -> 3
Does anyone have any idea what this issue might be caused by?
Cheers in advance