Hi everyone,
I have a strange issue. I am working on a single node I know it has 4 gpus and cuda drivers 11.7 correctly installed. I am trying to request one single gpu, however it returns me the following error:
File “/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/init.py”, line 374, in get_device_properties
raise AssertionError(“Invalid device id”)
If I print the following in the Python script:
print('CUDA available? '+str(torch.cuda.is_available()))
print('N. of devices:' + str(torch.cuda.device_count()))
print(os.environ['CUDA_VISIBLE_DEVICES'])
I get:
CUDA available? True
N. of devices:0
GPU-alphanumeric_id
But the strange thing is that it returns I do not have any gpu.
I’ve also tried to overwrite os.environ[‘CUDA_VISIBLE_DEVICES’] with os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’, but I have the same issue.
Hi Patrick, here there is the full error. I’m using Pytorch Lightning.
Traceback (most recent call last):
File "<path_to_project_folder>/main.py", line 456, in <module>
main()
File "<path_to_project_folder>/main.py", line 352, in main
trainer.fit(
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1048, in _run
self.strategy.setup_environment()
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 131, in setup_environment
self.accelerator.setup_device(self.root_device)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 43, in setup_device
_check_cuda_matmul_precision(device)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/lightning_fabric/accelerators/cuda.py", line 346, in _check_cuda_matmul_precision
major, _ = torch.cuda.get_device_capability(device)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id
Reading some answers to other similar questions, I saw that the function torch.cuda.device_count() is not returning the correct number of devices, instead to have the correct number of devices I have to use torch._C._cuda_getDeviceCount(), indeed I got from print(str(torch._C._cuda_getDeviceCount())) that I have 1 device.
However, I cannot understand why I am incurring in this issue about the index of the device which is not correct.
CUDA available? True
N. of devices:0
GPU-<alphanumeric string>
1
current device: 0
And error:
Traceback (most recent call last):
File "<path_to_project_dir>/main.py", line 460, in <module>
main()
File "<path_to_project_dir>/main.py", line 104, in main
major, _ = torch.cuda.get_device_capability()
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id
Which value did you set this env variable to and where did you get the ID from? My guess is the GPU ID might be wrong and you are thus masking all devices. An easy way to use specific devices would be to use their integer IDs, i.e. 0, 1, 2 etc. instead of the full ID.