NCCL error when training data in GCP

Please provide the following information when requesting support.

• Hardware : GCP (Google Cloud Platform) → a100 40gb 8EA → ubuntu20.04
• torch=1.9.0+cu111
• torchvision=0.10.0+cu111
• torchaudio=0.9.0
• python=3.8
• mmcv-full==1.6.0

When I do the deep learning training as shown below, the error occurs.
I think there is an error in graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/..../fffffff:fff.

a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO Using network Socket

a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in

a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO group.cc:73 → 2 [Async thread]
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in

a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
main()a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO graph/xml.cc:648 → 2

File “tools/train.py”, line 289, in main
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO init.cc:840 → 2
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO group.cc:73 → 2 [Async thread]
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 233417) of binary: /home/morai_developer/anaconda3/envs/cmt/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/7/error.json
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin

Below are the results of nvidia-smi.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM… Off | 00000000:00:04.0 Off | 0 |
| N/A 30C P0 53W / 400W | 112MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM… Off | 00000000:00:05.0 Off | 0 |
| N/A 29C P0 54W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM… Off | 00000000:00:06.0 Off | 0 |
| N/A 29C P0 53W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM… Off | 00000000:00:07.0 Off | 0 |
| N/A 30C P0 58W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA A100-SXM… Off | 00000000:80:00.0 Off | 0 |
| N/A 29C P0 57W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 5 NVIDIA A100-SXM… Off | 00000000:80:01.0 Off | 0 |
| N/A 31C P0 57W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 6 NVIDIA A100-SXM… Off | 00000000:80:02.0 Off | 0 |
| N/A 28C P0 56W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 7 NVIDIA A100-SXM… Off | 00000000:80:03.0 Off | 0 |
| N/A 30C P0 52W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2477 G /usr/lib/xorg/Xorg 95MiB |
| 0 N/A N/A 2585 G /usr/bin/gnome-shell 12MiB |
| 1 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+

Below is the result of nvcc-V.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

I see you are using cuda 11.8, but nccl 2.7.8 which is not tested with that cuda version. Can you try upgrading your nccl version?

You can try the nccl version that comes with PyTorch nightlies or just upgrade to nccl 2.17