NCCL error: unhandled system error, NCCL version 2.4.8

faced NCCL error while trying to run python training script:

Init multi-processing training...
d13186ffee3a:57:57 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
d13186ffee3a:57:57 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

d13186ffee3a:57:57 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
d13186ffee3a:57:57 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
NCCL version 2.4.8+cuda10.1

d13186ffee3a:57:75 [0] misc/topo.cc:22 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0
d13186ffee3a:57:75 [0] NCCL INFO init.cc:876 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:909 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:947 -> 2
d13186ffee3a:57:75 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
  File "train.py", line 174, in <module>
    run_train()
  File "train.py", line 171, in run_train
    multi_train(args, config, Network)
  File "train.py", line 155, in multi_train
    torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/crowddet/tools/train.py", line 100, in train_worker
    net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[rank], broadcast_buffers=False)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
    self.broadcast_bucket_size)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1587428398394/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8

env:

CUDNN_VERSION=7.6.5.32
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LESSCLOSE=/usr/bin/lesspipe %s %s
NVIDIA_VISIBLE_DEVICES=all
NCCL_VERSION=2.4.8
PWD=/crowddet
HOME=/root
CMAKE_PREFIX_PATH=$(dirname $(which conda))/../
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
TERM=xterm
TORCH_CUDA_ARCH_LIST=6.0 6.1 7.0+PTX
CUDA_PKG_VERSION=10-1=10.1.243-1
CUDA_VERSION=10.1.243
NVIDIA_DRIVER_CAPABILITIES=compute,utility
SHLVL=1
NVIDIA_REQUIRE_CUDA=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411
TORCH_NVCC_FLAGS=-Xfatbin -compress-all
PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LESSOPEN=| /usr/bin/lesspipe %s
_=/usr/bin/env

tried solutions of many nccl errors but still unable to solve

NCCL 2.4.8 was released in 2019 (as well as the rest of the used libs), so update PyTorch to the latest stable or nightly release, which ships with newer and supported CUDA and NCCL versions.

Thank you!
I used a newer version of pytorch by using this pytorch image:

ARG PYTORCH="2.1.2"
ARG CUDA="11.8"
ARG CUDNN="8"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

When I checked dockerhub, the compressed size of this newer image is about 6 times larger than the previous one i was using. My NCCL error is now gone but I was wondering if this may have anything to do with my new error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 62.00 MiB. GPU 0 has a total capacty of 2.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 967.91 MiB is allocated by PyTorch, and 76.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

17179869184.00 GiB seems absurd, and I’m not sure if this is normal…