RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
Traceback (most recent call last):
  File "/ghome/luoxin/projects/liif-lightning-hydra/run.py", line 34, in main
    return train(config)
  File "/ghome/luoxin/projects/liif-lightning-hydra/src/train.py", line 78, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 108, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 157, in new_process
    self.configure_ddp()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 195, in configure_ddp
    self._model = DistributedDataParallel(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I use pytorch official image pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime, and based that installed pytorch-lightning to use multi-GPU, it seems a pytorch problem, how can I tackle this?

Full environment:

PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090
GPU 2: GeForce RTX 3090
GPU 3: GeForce RTX 3090

Nvidia driver version: 460.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.2.5
[pip3] torch==1.8.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.2.0
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.2           py38h54aff64_0  
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch                   1.8.0           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-lightning         1.2.5                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchmetrics              0.2.0                    pypi_0    pypi
[conda] torchtext                 0.9.0                      py38    pytorch
[conda] torchvision               0.9.0                py38_cu111    pytorch

You could run the script with NCCL_DEBUG=INFO python script.py args to get more debug information from NCCL, which should also contain the root cause of this issue.

1 Like

Yes, I did that and solved this issue simply use --ipc=host in my docker.

2 Likes

Hi,
I am also getting the same issue and the detailed error with the env variable NCCL_DEBUG=INFO is the following -

u124281:2415987:2415987 [0] NCCL INFO Bootstrap : Using eno1:128.208.233.110<0>
u124281:2415987:2415987 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

u124281:2415987:2415987 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u124281:2415987:2415987 [0] NCCL INFO NET/Socket : Using [0]eno1:128.208.233.110<0> [1]veth7c224ba:fe80::a0ae:57ff:fe20:75f0%veth7c224ba<0> [2]vethc9ae3a1:fe80::60d3:79ff:fe6a:5b88%vethc9ae3a1<0>
u124281:2415987:2415987 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
u124281:2416010:2416010 [0] NCCL INFO Bootstrap : Using eno1:128.208.233.110<0>
u124281:2416010:2416010 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

u124281:2416010:2416010 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u124281:2416010:2416010 [0] NCCL INFO NET/Socket : Using [0]eno1:128.208.233.110<0> [1]veth7c224ba:fe80::a0ae:57ff:fe20:75f0%veth7c224ba<0> [2]vethc9ae3a1:fe80::60d3:79ff:fe6a:5b88%vethc9ae3a1<0>
u124281:2416010:2416010 [0] NCCL INFO Using network Socket


u124281:2416010:2416046 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000
u124281:2415987:2416045 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 18000
u124281:2416010:2416046 [0] NCCL INFO init.cc:904 -> 5
u124281:2415987:2416045 [0] NCCL INFO init.cc:904 -> 5
u124281:2416010:2416046 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
u124281:2415987:2416045 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
Error executing job with overrides: ['stage=1', 'net.arch=ConvNet', 'dataset.name=cifar10', 'dataset.channel=3']
Traceback (most recent call last):
  File "main.py", line 53, in main
    mp.spawn(ddp_wrapper, args = (train_loader, val_loader, test_loader, config_dict, ws), nprocs=ws, join=True)
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/megh98/ddist/utils/utils.py", line 138, in ddp_wrapper
    trainer = Trainer(train_loader, val_loader, test_loader, config_dict, gpu_id = rank)
  File "/home/megh98/ddist/train.py", line 33, in __init__
    self.net = DDP(self.net, device_ids=[self.gpu_id])
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484683044/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Thanks and please let me know what the potential issue is.

Your error is raised in:

Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000

which points towards a script error which tries to reuse the same GPU for different ranks.

Thanks, yeah the issue was that inside the mp.spawn function I had to torch.cuda.set_device(). But I am facing another error which is the following

RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Double'

I am using double precision across my entire codebase. I think the above error is more of a feature request than an error.

Thank you for your time and please let me know if I am missing something.

Could you share a minimal code snippet which raises the new error, please?