N. of devices 0 and available gpus

Hi everyone,
I have a strange issue. I am working on a single node I know it has 4 gpus and cuda drivers 11.7 correctly installed. I am trying to request one single gpu, however it returns me the following error:

File “/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/init.py”, line 374, in get_device_properties
raise AssertionError(“Invalid device id”)

If I print the following in the Python script:

print('CUDA available? '+str(torch.cuda.is_available()))
print('N. of devices:' + str(torch.cuda.device_count()))
print(os.environ['CUDA_VISIBLE_DEVICES'])

I get:

CUDA available? True
N. of devices:0
GPU-alphanumeric_id

But the strange thing is that it returns I do not have any gpu.

I’ve also tried to overwrite os.environ[‘CUDA_VISIBLE_DEVICES’] with os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’, but I have the same issue.

Can anyone help with this issue?

Which line of code is raising the issue?

Hi Patrick, here there is the full error. I’m using Pytorch Lightning.

Traceback (most recent call last):
  File "<path_to_project_folder>/main.py", line 456, in <module>
    main()
  File "<path_to_project_folder>/main.py", line 352, in main
    trainer.fit(
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1048, in _run
    self.strategy.setup_environment()
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 131, in setup_environment
    self.accelerator.setup_device(self.root_device)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 43, in setup_device
    _check_cuda_matmul_precision(device)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/lightning_fabric/accelerators/cuda.py", line 346, in _check_cuda_matmul_precision
    major, _ = torch.cuda.get_device_capability(device)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
    prop = get_device_properties(device)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

Reading some answers to other similar questions, I saw that the function torch.cuda.device_count() is not returning the correct number of devices, instead to have the correct number of devices I have to use torch._C._cuda_getDeviceCount(), indeed I got from print(str(torch._C._cuda_getDeviceCount())) that I have 1 device.

However, I cannot understand why I am incurring in this issue about the index of the device which is not correct.

thanks for your help,
Sara

It seems the function which is causing the error is torch.cuda.get_device_capability(). Indeed, if I do the following:

print('CUDA available? '+str(torch.cuda.is_available()))
print('N. of devices:' + str(torch.cuda.device_count()))
print(os.environ['CUDA_VISIBLE_DEVICES'])
print(str(torch._C._cuda_getDeviceCount()))
print('current device: ' + str(torch.cuda.current_device()))
major, _ = torch.cuda.get_device_capability()

I have as output:

CUDA available? True
N. of devices:0
GPU-<alphanumeric string>
1
current device: 0

And error:

Traceback (most recent call last):
  File "<path_to_project_dir>/main.py", line 460, in <module>
    main()
  File "<path_to_project_dir>/main.py", line 104, in main
    major, _ = torch.cuda.get_device_capability()
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
    prop = get_device_properties(device)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

Does this behavior change if you unset CUDA_VISIBLE_DEVICES?

By unsetting the environment variable, the script is running!

if 'CUDA_VISIBLE_DEVICES' in os.environ:
        os.environ.pop('CUDA_VISIBLE_DEVICES', None)

What is your explanation of the cause?

Thanks for your help,
Sara

Which value did you set this env variable to and where did you get the ID from? My guess is the GPU ID might be wrong and you are thus masking all devices. An easy way to use specific devices would be to use their integer IDs, i.e. 0, 1, 2 etc. instead of the full ID.