RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
Traceback (most recent call last):
  File "/ghome/luoxin/projects/liif-lightning-hydra/run.py", line 34, in main
    return train(config)
  File "/ghome/luoxin/projects/liif-lightning-hydra/src/train.py", line 78, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 108, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 157, in new_process
    self.configure_ddp()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 195, in configure_ddp
    self._model = DistributedDataParallel(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I use pytorch official image pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime, and based that installed pytorch-lightning to use multi-GPU, it seems a pytorch problem, how can I tackle this?

Full environment:

PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 3090
GPU 1: GeForce RTX 3090
GPU 2: GeForce RTX 3090
GPU 3: GeForce RTX 3090

Nvidia driver version: 460.67
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch-lightning==1.2.5
[pip3] torch==1.8.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.2.0
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.2           py38h54aff64_0  
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch                   1.8.0           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-lightning         1.2.5                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchmetrics              0.2.0                    pypi_0    pypi
[conda] torchtext                 0.9.0                      py38    pytorch
[conda] torchvision               0.9.0                py38_cu111    pytorch

You could run the script with NCCL_DEBUG=INFO python script.py args to get more debug information from NCCL, which should also contain the root cause of this issue.

1 Like

Yes, I did that and solved this issue simply use --ipc=host in my docker.

1 Like

Hi,
I am also getting the same issue and the detailed error with the env variable NCCL_DEBUG=INFO is the following -

u124281:2415987:2415987 [0] NCCL INFO Bootstrap : Using eno1:128.208.233.110<0>
u124281:2415987:2415987 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

u124281:2415987:2415987 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u124281:2415987:2415987 [0] NCCL INFO NET/Socket : Using [0]eno1:128.208.233.110<0> [1]veth7c224ba:fe80::a0ae:57ff:fe20:75f0%veth7c224ba<0> [2]vethc9ae3a1:fe80::60d3:79ff:fe6a:5b88%vethc9ae3a1<0>
u124281:2415987:2415987 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
u124281:2416010:2416010 [0] NCCL INFO Bootstrap : Using eno1:128.208.233.110<0>
u124281:2416010:2416010 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

u124281:2416010:2416010 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
u124281:2416010:2416010 [0] NCCL INFO NET/Socket : Using [0]eno1:128.208.233.110<0> [1]veth7c224ba:fe80::a0ae:57ff:fe20:75f0%veth7c224ba<0> [2]vethc9ae3a1:fe80::60d3:79ff:fe6a:5b88%vethc9ae3a1<0>
u124281:2416010:2416010 [0] NCCL INFO Using network Socket


u124281:2416010:2416046 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000
u124281:2415987:2416045 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 18000
u124281:2416010:2416046 [0] NCCL INFO init.cc:904 -> 5
u124281:2415987:2416045 [0] NCCL INFO init.cc:904 -> 5
u124281:2416010:2416046 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
u124281:2415987:2416045 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
Error executing job with overrides: ['stage=1', 'net.arch=ConvNet', 'dataset.name=cifar10', 'dataset.channel=3']
Traceback (most recent call last):
  File "main.py", line 53, in main
    mp.spawn(ddp_wrapper, args = (train_loader, val_loader, test_loader, config_dict, ws), nprocs=ws, join=True)
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/megh98/ddist/utils/utils.py", line 138, in ddp_wrapper
    trainer = Trainer(train_loader, val_loader, test_loader, config_dict, gpu_id = rank)
  File "/home/megh98/ddist/train.py", line 33, in __init__
    self.net = DDP(self.net, device_ids=[self.gpu_id])
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/megh98/.conda/envs/ddist/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484683044/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Thanks and please let me know what the potential issue is.

Your error is raised in:

Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000

which points towards a script error which tries to reuse the same GPU for different ranks.

Thanks, yeah the issue was that inside the mp.spawn function I had to torch.cuda.set_device(). But I am facing another error which is the following

RuntimeError: "nll_loss_forward_reduce_cuda_kernel_2d_index" not implemented for 'Double'

I am using double precision across my entire codebase. I think the above error is more of a feature request than an error.

Thank you for your time and please let me know if I am missing something.

Could you share a minimal code snippet which raises the new error, please?

Actually for me NCCL_DEBUG=INFO gives the same error output. No extra information was shown.
But i remember one time i this line in the error.
My problem is not solved by any other means. i tried a lot

RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

Both while running with ray as well as running with torch.distributed.run i am getting this same error

This would mean that either NCCL isn’t used at all or that you are not exporting the env variable properly. I would recommend exporting it in your terminal, not trying to export it in your Python script.

no sir, i tried even NCCL_DEBUG=INFO python …,py
But i missed to mention ,
i also exported the variable

(RayExecutor pid=426700, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=575995) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=575995) distributed_backend=nccl
(RayExecutor pid=575995) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=575995) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=575995) 
(RayExecutor pid=575995) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
(RayExecutor pid=575995) hostssh:575995:575995 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=575995) hostssh:575995:575995 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=575995) hostssh:575995:575995 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=575995) NCCL version 2.14.3+cuda11.7

Hi @ptrblck,

Running into a similar issue. Setting NCCL_DEBUG=INFO gives:

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Global seed set to 73
Building tokeniser...
Finished tokeniser.
Reading dataset...
Using USPTO 50K dataset without type tokens.
Finished dataset.
Building data module...
Using a batch size of 4.
Building data module for backward prediction task...
Using 24 workers for data module.
Finished datamodule.
Train steps: 125100
Loading model...
Finished model.
Building trainer...
Num gpus: 2
Accelerator: ddp
Finished trainer.
Fitting data module to model
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeedconfig.Using `configure_optimizers` to define optimizer and scheduler.
Using cyclical LR schedule.
Using cyclical LR schedule.
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO Bootstrap : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO NET/IB : No device found.
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO NET/Socket : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:716:716 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO Bootstrap : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO NET/IB : No device found.
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO NET/Socket : Using [0]eth0:100.100.189.224<0>
lhtybc8nw6rpnj6h:791:791 [1] NCCL INFO Using network Socket

lhtybc8nw6rpnj6h:791:875 [1] misc/nvmlwrap.cc:115 NCCL WARN nvmlInit() failed: Driver/library version mismatch
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO graph/xml.cc:662 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO graph/topo.cc:523 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO init.cc:581 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO init.cc:840 -> 2
lhtybc8nw6rpnj6h:791:875 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
Traceback (most recent call last):
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 214, in <module>
    main(args)
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 168, in main
    trainer.fit(model, datamodule=dm)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 224, in _initialize_deepspeed_train
    config_params=self.config,
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/__init__.py", line 121, in initialize
    config_params=config_params)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 160, in __init__
    self._configure_distributed_model(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 582, in _configure_distributed_model
    self._broadcast_model()
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 558, in _broadcast_model
    group=self.data_parallel_group)
  File "/home/cdsw/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1043, in broadcast

lhtybc8nw6rpnj6h:716:874 [0] misc/nvmlwrap.cc:115 NCCL WARN nvmlInit() failed: Driver/library version mismatch
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO graph/xml.cc:662 -> 2
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO graph/topo.cc:523 -> 2
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO init.cc:581 -> 2
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO init.cc:840 -> 2
    work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
lhtybc8nw6rpnj6h:716:874 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 214, in <module>
    main(args)
  File "/home/cdsw/Chemformer-main/molbart/fine_tune.py", line 168, in main
    trainer.fit(model, datamodule=dm)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 224, in _initialize_deepspeed_train
    config_params=self.config,
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/__init__.py", line 121, in initialize
    config_params=config_params)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 160, in __init__
    self._configure_distributed_model(model)
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 582, in _configure_distributed_model
    self._broadcast_model()
  File "/home/cdsw/.local/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 558, in _broadcast_model
    group=self.data_parallel_group)
  File "/home/cdsw/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1043, in broadcast
    work = group.broadcast([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I’m using torch 1.8.0+cu111 and pytorch-lightning 1.2.3. Any help would be appreciated.

If you’ve recently updated your driver reboot the system. Otherwise check which drivers are used and if they are compatible to the old PyTorch binary you are using.

1 Like

As for my case, I use pytorch/pytorch:1.8.1-cuda10.2-cudnn7-devel, then I ran into the same error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554786529/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

And docker run with “–ipc=host” didn’t work

I switch to pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel and no need to “–ipc=host”, problem solved

Hello, I encountered this error in the operation process, can not find a solution, can you help me?thanks
DeepLearningServerNode1:3716966:3716966 [1] NCCL INFO NET/IB : No device found.
DeepLearningServerNode1:3716966:3716966 [1] NCCL INFO NET/Socket : Using [0]enp5s0:192.168.110.46<0> [1]virbr0:192.168.122.1<0> [2]br-508c3b386928:172.23.0.1<0> [3]br-f8b185c81642:172.19.0.1<0> [4]vmnet1:172.16.53.1<0> [5]vmnet8:192.168.175.1<0> [6]veth079bded:fe80::bc7a:5dff:fed4:e15e%veth079bded<0>
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
DeepLearningServerNode1:3716966:3719817 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Channel 00 : 0 1
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Ring 00 : 0[0] → 1[1] via direct shared memory
DeepLearningServerNode1:3716966:3719817 [1] NCCL INFO Ring 00 : 1[1] → 0[0] via direct shared memory
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
DeepLearningServerNode1:3716965:3719816 [0] NCCL INFO comm 0x7fa2d0002590 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
DeepLearningServerNode1:3716965:3716965 [0] NCCL INFO Launch mode Parallel
DeepLearningServerNode1:3716966:3719817 [1] NCCL INFO comm 0x7fa3f8002590 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE

DeepLearningServerNode1:3716965:3716965 [0] enqueue.cc:197 NCCL WARN Cuda failure ‘invalid device function’
DeepLearningServerNode1:3716965:3716965 [0] NCCL INFO misc/group.cc:148 → 1

DeepLearningServerNode1:3716966:3716966 [1] enqueue.cc:197 NCCL WARN Cuda failure ‘invalid device function’
DeepLearningServerNode1:3716966:3716966 [1] NCCL INFO misc/group.cc:148 → 1
Traceback (most recent call last):
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
Traceback (most recent call last):
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/u232080231/JLX/UPR-Net-master/tools/train.py”, line 388, in
main”, mod_spec)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/u232080231/JLX/UPR-Net-master/tools/train.py”, line 388, in
LOCAL_RANK, training=True, resume=RESUME)
File “/home/u232080231/JLX/UPR-Net-master/core/pipeline.py”, line 51, in init
output_device=local_rank, find_unused_parameters=False)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 333, in init
LOCAL_RANK, training=True, resume=RESUME)
File “/home/u232080231/JLX/UPR-Net-master/core/pipeline.py”, line 51, in init
output_device=local_rank, find_unused_parameters=False)
self.broadcast_bucket_size) File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 333, in init

File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 549, in _distributed_broadcast_coalesced
self.broadcast_bucket_size)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 549, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629427478/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629427478/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8
Traceback (most recent call last):
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/distributed/launch.py”, line 261, in
main()
File “/home/u232080231/.conda/envs/uprnet/lib/python3.7/site-packages/torch/distributed/launch.py”, line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command ‘[’/home/u232080231/.conda/envs/uprnet/bin/python3’, ‘-u’, ‘-m’, ‘tools.train’, ‘–local_rank=1’, ‘–world_size=2’, ‘–data_root’, ‘/home/public/Datasets/vimeo_triplet’, ‘–train_log_root’, ‘…/upr-train-log’, ‘–exp_name’, ‘upr-base’, ‘–batch_size’, ‘4’, ‘–nr_data_worker’, ‘2’]’ returned non-zero exit status 1.

Your NCCL version is quite old by now as NCCL==2.4.8 was released in Apil 2019.
I would recommend updating NCCL (as well as PyTorch in case you are also using an older version).