BH4CYI
January 10, 2022, 9:59pm
1
I was trying to start a distributed training across 2 machines using RoCE.
Environment:
Ubuntu 20.04
pytorch torch==1.10.1+cu113
The task is established by slurm:
#!/bin/bash
#SBATCH --gres=gpu:2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
srun python train.py
And the error I met is:
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.```
Line 957 in pytorch/ProcessGroupNCCL.cpp at v1.10.1 · pytorch/pytorch · GitHub is:
C10D_NCCL_CHECK(ncclGroupEnd(), c10::nullopt);
Training works if I set ‘NCCL_IB_DISABLE’ = ‘1’. Which might indicate this might be a RoCE related issue.
I was wonder if I’m looking the correct place. And could anyone help me with this issue?
Can you re-run the script with the environment variable NCCL_DEBUG=INFO
to see if NCCL debug log reports any errors that could be helpful here?
BH4CYI
January 11, 2022, 12:50am
3
NCCL_DEBUG=INFO
won’t give any useful information. NCCL_DEBUG_SUBSYS=ALL
will give some memory alloc information which doesn’t help too much.
File "train.py", line 226, in train
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
ddp_model = DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
File "/mnt/anaconda/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7f47040a11d0
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7f47040a1170
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7f47040a11d0
slurm-2:2498277:2498586 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7f47040a1210
dist._verify_model_across_ranks(self.process_group, parameters)
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:401 Mem Alloc Size 4 pointer 0x7fa25c0a11f0
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:402 Mem Alloc Size 4 pointer 0x7fa25c0a1210
slurm-1:2061438:2061652 [0] NCCL INFO graph/search.cc:452 Mem Alloc Size 4 pointer 0x7fa25c0a1170
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
BH4CYI
January 11, 2022, 12:54am
4
I did get a warning saying libibverbs: Warning: couldn't load driver 'libi40iw-rdmav25.so': libi40iw-rdmav25.so: cannot open shared object file: No such file or directory
But as far as I know libi40iw is for Intel Ethernet Connection X722 RDMA. And my NIC is E810, which doesn’t compatible with this lib.
BH4CYI
January 11, 2022, 5:29pm
5
I rerun the code today and found such INFO:
misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument
@BH4CYI Might be useful to directly contact the NCCL team here to debug this further. I’d suggest probably creating a gh issue here: GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication and sharing all the NCCL debug logs with them to determine a resolution.
BH4CYI
January 13, 2022, 3:21pm
7
BH4CYI
January 26, 2022, 8:22pm
8
I applyed the patch mentioned in the github issue, and nccl-test passed on multi node. I think what I can do now is to re-compile PyTorch with patched NCCL.
1 Like