Intel E810 RoCE NCCL unhandled system error

I was trying to start a distributed training across 2 machines using RoCE.
Environment:

  • Ubuntu 20.04
  • pytorch torch==1.10.1+cu113

The task is established by slurm:

#!/bin/bash
#SBATCH --gres=gpu:2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4

srun python train.py

And the error I met is:

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.```

Line 957 in pytorch/ProcessGroupNCCL.cpp at v1.10.1 · pytorch/pytorch · GitHub is:

C10D_NCCL_CHECK(ncclGroupEnd(), c10::nullopt);

Training works if I set ‘NCCL_IB_DISABLE’ = ‘1’. Which might indicate this might be a RoCE related issue.

I was wonder if I’m looking the correct place. And could anyone help me with this issue?

Can you re-run the script with the environment variable NCCL_DEBUG=INFO to see if NCCL debug log reports any errors that could be helpful here?

NCCL_DEBUG=INFO won’t give any useful information. NCCL_DEBUG_SUBSYS=ALL will give some memory alloc information which doesn’t help too much.

  File "train.py", line 226, in train
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
    ddp_model = DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/mnt/anaconda/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7f47040a11d0
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7f47040a1170
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7f47040a11d0
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7f47040a1210
    dist._verify_model_across_ranks(self.process_group, parameters)
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I did get a warning saying libibverbs: Warning: couldn't load driver 'libi40iw-rdmav25.so': libi40iw-rdmav25.so: cannot open shared object file: No such file or directory

But as far as I know libi40iw is for Intel Ethernet Connection X722 RDMA. And my NIC is E810, which doesn’t compatible with this lib.

I rerun the code today and found such INFO:

misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument

@BH4CYI Might be useful to directly contact the NCCL team here to debug this further. I’d suggest probably creating a gh issue here: GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication and sharing all the NCCL debug logs with them to determine a resolution.

GitHub issue is here: Intel E810 ncclSystemError: System call (socket, malloc, munmap, etc) failed. · Issue #622 · NVIDIA/nccl · GitHub

I applyed the patch mentioned in the github issue, and nccl-test passed on multi node. I think what I can do now is to re-compile PyTorch with patched NCCL.

1 Like