RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

LuoXin-s · March 27, 2021, 2:37am

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
Traceback (most recent call last):
  File "/ghome/luoxin/projects/liif-lightning-hydra/run.py", line 34, in main
    return train(config)
  File "/ghome/luoxin/projects/liif-lightning-hydra/src/train.py", line 78, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 108, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 157, in new_process
    self.configure_ddp()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 195, in configure_ddp
    self._model = DistributedDataParallel(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I use pytorch official image pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime, and based that installed pytorch-lightning to use multi-GPU, it seems a pytorch problem, how can I tackle this?

Full environment:

PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090
GPU 2: GeForce RTX 3090
GPU 3: GeForce RTX 3090

Nvidia driver version: 460.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.2.5
[pip3] torch==1.8.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.2.0
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.2           py38h54aff64_0  
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch                   1.8.0           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-lightning         1.2.5                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchmetrics              0.2.0                    pypi_0    pypi
[conda] torchtext                 0.9.0                      py38    pytorch
[conda] torchvision               0.9.0                py38_cu111    pytorch

ptrblck · March 28, 2021, 7:05am

You could run the script with NCCL_DEBUG=INFO python script.py args to get more debug information from NCCL, which should also contain the root cause of this issue.

LuoXin-s · March 28, 2021, 7:17am

Yes, I did that and solved this issue simply use --ipc=host in my docker.

Megh_Bhalerao · October 14, 2022, 12:19am

Hi,
I am also getting the same issue and the detailed error with the env variable NCCL_DEBUG=INFO is the following -

u124281:2415987:2415987 [0] NCCL INFO Bootstrap : Using eno1:128.208.233.110<0>
u124281:2415987:2415987 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

u124281:2415987:2415987 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u124281:2415987:2415987 [0] NCCL INFO NET/Socket : Using [0]eno1:128.208.233.110<0> [1]veth7c224ba:fe80::a0ae:57ff:fe20:75f0%veth7c224ba<0> [2]vethc9ae3a1:fe80::60d3:79ff:fe6a:5b88%vethc9ae3a1<0>
u124281:2415987:2415987 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
u124281:2416010:2416010 [0] NCCL INFO Bootstrap : Using eno1:128.208.233.110<0>
u124281:2416010:2416010 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

u124281:2416010:2416010 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u124281:2416010:2416010 [0] NCCL INFO NET/Socket : Using [0]eno1:128.208.233.110<0> [1]veth7c224ba:fe80::a0ae:57ff:fe20:75f0%veth7c224ba<0> [2]vethc9ae3a1:fe80::60d3:79ff:fe6a:5b88%vethc9ae3a1<0>
u124281:2416010:2416010 [0] NCCL INFO Using network Socket


u124281:2416010:2416046 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000
u124281:2415987:2416045 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 18000
u124281:2416010:2416046 [0] NCCL INFO init.cc:904 -> 5
u124281:2415987:2416045 [0] NCCL INFO init.cc:904 -> 5
u124281:2416010:2416046 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
u124281:2415987:2416045 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
Error executing job with overrides: ['stage=1', 'net.arch=ConvNet', 'dataset.name=cifar10', 'dataset.channel=3']
Traceback (most recent call last):
  File "main.py", line 53, in main
    mp.spawn(ddp_wrapper, args = (train_loader, val_loader, test_loader, config_dict, ws), nprocs=ws, join=True)
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/megh98/ddist/utils/utils.py", line 138, in ddp_wrapper
    trainer = Trainer(train_loader, val_loader, test_loader, config_dict, gpu_id = rank)
  File "/home/megh98/ddist/train.py", line 33, in __init__
    self.net = DDP(self.net, device_ids=[self.gpu_id])
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484683044/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Thanks and please let me know what the potential issue is.

ptrblck · October 14, 2022, 5:17am

Your error is raised in:

Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000

which points towards a script error which tries to reuse the same GPU for different ranks.

Megh_Bhalerao · October 14, 2022, 5:43pm

Thanks, yeah the issue was that inside the mp.spawn function I had to torch.cuda.set_device(). But I am facing another error which is the following

RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Double'

I am using double precision across my entire codebase. I think the above error is more of a feature request than an error.

Thank you for your time and please let me know if I am missing something.

ptrblck · October 16, 2022, 7:07am

Could you share a minimal code snippet which raises the new error, please?

NavinKumarMNK · March 14, 2023, 3:18pm

Actually for me NCCL_DEBUG=INFO gives the same error output. No extra information was shown.
But i remember one time i this line in the error.
My problem is not solved by any other means. i tried a lot

RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

Both while running with ray as well as running with torch.distributed.run i am getting this same error

ptrblck · March 14, 2023, 4:18pm

This would mean that either NCCL isn’t used at all or that you are not exporting the env variable properly. I would recommend exporting it in your terminal, not trying to export it in your Python script.

NavinKumarMNK · March 14, 2023, 4:26pm

no sir, i tried even NCCL_DEBUG=INFO python …,py
But i missed to mention ,
i also exported the variable

(RayExecutor pid=426700, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=575995) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=575995) distributed_backend=nccl
(RayExecutor pid=575995) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=575995) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=575995) 
(RayExecutor pid=575995) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
(RayExecutor pid=575995) hostssh:575995:575995 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=575995) hostssh:575995:575995 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=575995) hostssh:575995:575995 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=575995) NCCL version 2.14.3+cuda11.7

shoang22 · September 14, 2023, 3:24pm

Hi @ptrblck,

Running into a similar issue. Setting NCCL_DEBUG=INFO gives:

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Global seed set to 73
Building tokeniser...
Finished tokeniser.
Reading dataset...
Using USPTO 50K dataset without type tokens.
Finished dataset.
Building data module...
Using a batch size of 4.
Building data module for backward prediction task...
Using 24 workers for data module.
Finished datamodule.
Train steps: 125100
Loading model...
Finished model.
Building trainer...
Num gpus: 2
Accelerator: ddp
Finished trainer.
Fitting data module to model
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeedconfig.Using `configure_optimizers` to define optimizer and scheduler.
Using cyclical LR schedule.
Using cyclical LR schedule.
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO Bootstrap : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO NET/IB : No device found.
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO NET/Socket : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO Bootstrap : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO NET/IB : No device found.
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO NET/Socket : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO Using network Socket

lhtybc8nw6rpnj6h:791:875 [1] misc/nvmlwrap.cc:115 NCCL WARN nvmlInit() failed: Driver/library version mismatch
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO graph/xml.cc:662 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO graph/topo.cc:523 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO init.cc:581 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO init.cc:840 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
Traceback (most recent call last):
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 214, in <module>
    main(args)
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 168, in main
    trainer.fit(model, datamodule=dm)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 224, in _initialize_deepspeed_train
    config_params=self.config,
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/__init__.py", line 121, in initialize
    config_params=config_params)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 160, in __init__
    self._configure_distributed_model(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 582, in _configure_distributed_model
    self._broadcast_model()
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 558, in _broadcast_model
    group=self.data_parallel_group)
  File "/home/cdsw/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1043, in broadcast

lhtybc8nw6rpnj6h:716:874 [0] misc/nvmlwrap.cc:115 NCCL WARN nvmlInit() failed: Driver/library version mismatch
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO graph/xml.cc:662 -> 2
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO graph/topo.cc:523 -> 2
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO init.cc:581 -> 2
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO init.cc:840 -> 2
    work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 214, in <module>
    main(args)
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 168, in main
    trainer.fit(model, datamodule=dm)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 224, in _initialize_deepspeed_train
    config_params=self.config,
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/__init__.py", line 121, in initialize
    config_params=config_params)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 160, in __init__
    self._configure_distributed_model(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 582, in _configure_distributed_model
    self._broadcast_model()
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 558, in _broadcast_model
    group=self.data_parallel_group)
  File "/home/cdsw/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1043, in broadcast
    work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I’m using torch 1.8.0+cu111 and pytorch-lightning 1.2.3. Any help would be appreciated.

ptrblck · September 14, 2023, 4:21pm

If you’ve recently updated your driver reboot the system. Otherwise check which drivers are used and if they are compatible to the old PyTorch binary you are using.

TripleLuck · September 20, 2023, 3:10pm

As for my case, I use pytorch/pytorch:1.8.1-cuda10.2-cudnn7-devel, then I ran into the same error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554786529/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

And docker run with “–ipc=host” didn’t work

I switch to pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel and no need to “–ipc=host”, problem solved

JLXin · July 13, 2024, 2:19pm

Hello, I encountered this error in the operation process, can not find a solution, can you help me？thanks
DeepLearningServerNode1:3716966:3716966 [1] NCCL INFO NET/IB : No device found.
DeepLearningServerNode1:3716966:3716966 [1] NCCL INFO NET/Socket : Using [0]enp5s0:192.168.110.46<0> [1]virbr0:192.168.122.1<0> [2]br-508c3b386928:172.23.0.1<0> [3]br-f8b185c81642:172.19.0.1<0> [4]vmnet1:172.16.53.1<0> [5]vmnet8:192.168.175.1<0> [6]veth079bded:fe80::bc7a:5dff:fed4:e15e%veth079bded<0>
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
DeepLearningServerNode1:3716966:3719817 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Channel 00 : 0 1
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Ring 00 : 0[0] → 1[1] via direct shared memory
DeepLearningServerNode1:3716966:3719817 [1] NCCL INFO Ring 00 : 1[1] → 0[0] via direct shared memory
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO comm 0x7fa2d0002590 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
DeepLearningServerNode1:3716965:3716965 [0] NCCL INFO Launch mode Parallel
DeepLearningServerNode1:3716966:3719817 [1] NCCL INFO comm 0x7fa3f8002590 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE

DeepLearningServerNode1:3716965:3716965 [0] enqueue.cc:197 NCCL WARN Cuda failure ‘invalid device function’
DeepLearningServerNode1:3716965:3716965 [0] NCCL INFO misc/group.cc:148 → 1

DeepLearningServerNode1:3716966:3716966 [1] enqueue.cc:197 NCCL WARN Cuda failure ‘invalid device function’
DeepLearningServerNode1:3716966:3716966 [1] NCCL INFO misc/group.cc:148 → 1
Traceback (most recent call last):
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
Traceback (most recent call last):
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/u232080231/JLX/UPR-Net-master/tools/train.py”, line 388, in
“main”, mod_spec)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/u232080231/JLX/UPR-Net-master/tools/train.py”, line 388, in
LOCAL_RANK, training=True, resume=RESUME)
File “/home/u232080231/JLX/UPR-Net-master/core/pipeline.py”, line 51, in init
output_device=local_rank, find_unused_parameters=False)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 333, in init
LOCAL_RANK, training=True, resume=RESUME)
File “/home/u232080231/JLX/UPR-Net-master/core/pipeline.py”, line 51, in init
output_device=local_rank, find_unused_parameters=False)
self.broadcast_bucket_size) File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 333, in init

File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 549, in _distributed_broadcast_coalesced
self.broadcast_bucket_size)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 549, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629427478/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629427478/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8
Traceback (most recent call last):
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/distributed/launch.py”, line 261, in
main()
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/distributed/launch.py”, line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command ‘[’/home/u232080231/.conda/envs/uprnet/bin/python3’, ‘-u’, ‘-m’, ‘tools.train’, ‘–local_rank=1’, ‘–world_size=2’, ‘–data_root’, ‘/home/public/Datasets/vimeo_triplet’, ‘–train_log_root’, ‘…/upr-train-log’, ‘–exp_name’, ‘upr-base’, ‘–batch_size’, ‘4’, ‘–nr_data_worker’, ‘2’]’ returned non-zero exit status 1.

ptrblck · July 15, 2024, 2:04pm

Your NCCL version is quite old by now as NCCL==2.4.8 was released in Apil 2019.
I would recommend updating NCCL (as well as PyTorch in case you are also using an older version).