RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

dmlpt · July 30, 2021, 8:26pm

Hi All,
I am trying to run DINO on multiple nodes with facebookincubator/submitit repo. We have a slurm server and I am able to train DINO on the slurm server using a single node (8gpus) [WITHOUT USING submitit] but when I try to run with multiple nodes, I am getting the below error:

submitit ERROR (2021-07-30 01:10:30,581) - Submitted job triggered an exception
Traceback (most recent call last):
File “/home/user/skanaconda3/envs/url/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/user/skanaconda3/envs/url/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/_submit.py”, line 11, in
submitit_main()
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/submission.py”, line 71, in submitit_main
process_job(args.folder)
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/submission.py”, line 64, in process_job
raise error
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/submission.py”, line 53, in process_job
result = delayed.result()
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/submitit/core/utils.py”, line 128, in result
self._result = self.function(*self.args, **self.kwargs)
File “run_with_submitit.py”, line 67, in call
main_dino_initialize_all.train_dino(self.args)
File “/home/user/code/dino/main_dino_initialize_all.py”, line 143, in train_dino
utils.init_distributed_mode(args)
File “/home/user/code/dino/utils.py”, line 468, in init_distributed_mode
dist.init_process_group(
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 439, in init_process_group
_default_pg = _new_process_group_helper(
File “/home/user/skanaconda3/envs/url/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 528, in _new_process_group_helper
pg = ProcessGroupNCCL(
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

From logs, I see that the job initially gets assigned to two nodes [with 8 gpus in each node] and then stops with the above error. I think the code crashes at this line . Why does at::cuda::getNumGPUs() returns 0 when there are gpus available?

Thanks in advance!

H-Huang · August 2, 2021, 5:02pm

I am not familiar with submitit so I am unsure of how to validate the number of GPUs that is using. Before init_process_group can you also try printing the value of torch.cuda.device_count()? (torch.cuda.device_count — PyTorch master documentation). This may help to narrow down why there are any GPUs detected and whether this is an issue in the distributed package.