RuntimeError: NCCL Error 3: internal error

Hi,
I am using DataParallel in a SGE enviroment with cudatoolkit 10.1 and NCCL version 2.7. I get the following error

/miniconda3/envs/cs21/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 56, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)

RuntimeError: NCCL Error 3: internal error

The NCCL debug info is as follows

node20:22347:22347 [0] NCCL INFO Bootstrap : Using [0]ib0:10.10.9.220<0>
node20:22347:22347 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
node20:22347:22347 [0] NCCL INFO NET/IB : Using [0]qib0:1/IB ; OOB ib0:10.10.9.220<0>
node20:22347:22347 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda10.1

node20:22347:22436 [1] graph/xml.h:77 NCCL WARN Attribute class of node nic not found
node20:22347:22436 [1] NCCL INFO graph/topo.cc:312 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:348 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:395 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:467 -> 3
node20:22347:22436 [1] NCCL INFO graph/topo.cc:570 -> 3

node20:22347:22435 [0] graph/xml.h:77 NCCL WARN Attribute class of node nic not found
node20:22347:22436 [1] NCCL INFO init.cc:581 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:312 -> 3
node20:22347:22436 [1] NCCL INFO init.cc:840 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:348 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:395 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:467 -> 3
node20:22347:22435 [0] NCCL INFO graph/topo.cc:570 -> 3
node20:22347:22435 [0] NCCL INFO init.cc:581 -> 3
node20:22347:22435 [0] NCCL INFO init.cc:840 -> 3
node20:22347:22436 [1] NCCL INFO group.cc:73 -> 3 [Async thread]
node20:22347:22435 [0] NCCL INFO group.cc:73 -> 3 [Async thread]
node20:22347:22347 [0] NCCL INFO init.cc:906 -> 3

Does anyone have any idea what this issue might be caused by?

Cheers in advance

NCCL error 3 seems to be either a bug in NCCL or some memory corruption: Types — NCCL 2.8.3 documentation. Maybe you can create an issue at GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication to see if the NCCL team has some guidelines on how to debug this.

1 Like

To follow up, I think I actually had 2 issues firstly I had to set

export NCCL_SOCKET_IFNAME=<VALUE>
export NCCL_IB_DISABLE=1

Replacing with your relevant interface - use the ifconfig to find it. And I think my second issue was using a dataloader with multiple workers but I hadn’t allocated enough processes to the job in my job submission.

1 Like