Please provide the following information when requesting support.
• Hardware : GCP (Google Cloud Platform) → a100 40gb 8EA → ubuntu20.04
• torch=1.9.0+cu111
• torchvision=0.10.0+cu111
• torchaudio=0.9.0
• python=3.8
• mmcv-full==1.6.0
When I do the deep learning training as shown below, the error occurs.
I think there is an error in graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/..../fffffff:fff.
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233417:233417 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233417:233417 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233421:233421 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233421:233421 [4] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233422:233422 [5] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233422:233422 [5] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233420:233420 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233420:233420 [3] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233423:233423 [6] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233423:233423 [6] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233419:233419 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233419:233419 [2] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233418:233418 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233418:233418 [1] NCCL INFO Using network Socket
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO Bootstrap : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementationa100-40g-8ea-aisa-northeast1-a:233424:233424 [7] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO NET/Socket : Using [0]ens8:10.146.0.3<0>
a100-40g-8ea-aisa-northeast1-a:233424:233424 [7] NCCL INFO Using network Socketa100-40g-8ea-aisa-northeast1-a:233422:233973 [5] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233422:233973 [5] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233423:233974 [6] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233420:233975 [3] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233418:233977 [1] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, ina100-40g-8ea-aisa-northeast1-a:233419:233976 [2] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233419:233976 [2] NCCL INFO group.cc:73 → 2 [Async thread]
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233417:233972 [0] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO graph/xml.cc:648 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO init.cc:840 → 2
a100-40g-8ea-aisa-northeast1-a:233424:233980 [7] NCCL INFO group.cc:73 → 2 [Async thread]
Traceback (most recent call last):
File “tools/train.py”, line 301, ina100-40g-8ea-aisa-northeast1-a:233421:233978 [4] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/fffffff/…/…/fffffff:ff:f
main()a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO graph/xml.cc:648 → 2File “tools/train.py”, line 289, in main
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO graph/xml.cc:665 → 2
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO graph/topo.cc:523 → 2
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO init.cc:581 → 2
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO init.cc:840 → 2
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
a100-40g-8ea-aisa-northeast1-a:233421:233978 [4] NCCL INFO group.cc:73 → 2 [Async thread]
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
File “tools/train.py”, line 301, in
main()
File “tools/train.py”, line 289, in main
train_model(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 344, in train_model
train_detector(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet3d/apis/train.py”, line 226, in train_detector
model = MMDistributedDataParallel(
File “/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 233417) of binary: /home/morai_developer/anaconda3/envs/cmt/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_fnf6mo4m/none__7wlltgj/attempt_1/7/error.json
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
/home/morai_developer/anaconda3/envs/cmt/lib/python3.8/site-packages/mmdet/utils/setup_env.py:48: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
projects.mmdet3d_plugin
Below are the results of nvidia-smi
.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM… Off | 00000000:00:04.0 Off | 0 |
| N/A 30C P0 53W / 400W | 112MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM… Off | 00000000:00:05.0 Off | 0 |
| N/A 29C P0 54W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM… Off | 00000000:00:06.0 Off | 0 |
| N/A 29C P0 53W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM… Off | 00000000:00:07.0 Off | 0 |
| N/A 30C P0 58W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA A100-SXM… Off | 00000000:80:00.0 Off | 0 |
| N/A 29C P0 57W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 5 NVIDIA A100-SXM… Off | 00000000:80:01.0 Off | 0 |
| N/A 31C P0 57W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 6 NVIDIA A100-SXM… Off | 00000000:80:02.0 Off | 0 |
| N/A 28C P0 56W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 7 NVIDIA A100-SXM… Off | 00000000:80:03.0 Off | 0 |
| N/A 30C P0 52W / 400W | 7MiB / 40960MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2477 G /usr/lib/xorg/Xorg 95MiB |
| 0 N/A N/A 2585 G /usr/bin/gnome-shell 12MiB |
| 1 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 2477 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+
Below is the result of nvcc-V
.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0