Invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59

Yang-HangWA · April 15, 2020, 2:46am

I have studyed similar topic method to solve the problem, but nothing change. I use the code in I have 3 gpu, why torch.cuda.device_count() only return '1'.
Run the script below:

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
# call(["nvcc", "--version"]) does not work
#! nvcc --version
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())

print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

Output:

__Python VERSION: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
[GCC 7.3.0]
__pyTorch VERSION: 1.4.0
__CUDA VERSION
__CUDNN VERSION: 7603
__Number CUDA Devices: 1
__Devices
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 11 MiB, 32469 MiB
1, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
2, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
3, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
4, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
5, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
6, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
7, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
Active CUDA Device: GPU 0
Available devices  1
Current cuda device  0

Run:

import pycuda
from pycuda import compiler
import pycuda.driver as drv

drv.init()
print("%d device(s) found." % drv.Device.count())
           
for ordinal in range(drv.Device.count()):
    dev = drv.Device(ordinal)
    print (ordinal, dev.name())

Output:

1 device(s) found.
0 Tesla V100-PCIE-32GB

When I use --gpu=1 or anyother gpu index except 0.
It will break with message:

  File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/torch/cuda/__init__.py", line 292, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59

Btw, all the script run in a docker container.

So somebody can help?

ptrblck · April 15, 2020, 3:37am

Could you check, if you’ve set the environment variable CUDA_VISIBLE_DEVICES to a particular device and if so, remove it?

Yang-HangWA · April 15, 2020, 3:50am

@ptrblck Thanks for your reply. I solve it by unset CUDA_VISIBLE_DEVICES.

ilianherzi · August 5, 2020, 10:07pm

Hi @ptrblck! I was wondering if it’s possible to train in tandem by setting CUDA_VISIBLE_DEVICES twice for two separate training scripts? I have a single machine with two GPUs and I’d like to each of them for a separate training task.
For instance what I have now is

CUDA_VISIBLE_DEVICES=0,1 python train0.py

and

CUDA_VISIBLE_DEVICES=2,3 python train1.py

ptrblck · August 7, 2020, 4:23am

It would be generally possible to execute a script on a specific GPU.
However, since you have two GPUs in your machine, you would have to use:

CUDA_VISIBLE_DEVICES=0 python train0.py
# and
CUDA_VISIBLE_DEVICES=1 python train1.py

In your current code snippet you would try to use 4 GPUs (2 for each script).

ilianherzi · August 7, 2020, 11:37pm

Hey ptrblck, excuse me, I meant 4 GPUs. I’m currently using pytorch lightning with ddp and I keep getting the aforementioned error.

ptrblck · August 7, 2020, 11:43pm

If you are using CUDA_VISIBLE_DEVICES, the passed GPU indices will be mapped to indices starting at 0 in your script.
E.g. if you are using:

CUDA_VISIBLE_DEVICES=2,3 python train1.py

you would have to use 'cuda:0' and 'cuda:1' inside train1.py instead of 'cuda:2' and 'cuda:3'.
Could this be the issue in your script?

ilianherzi · August 12, 2020, 12:06am

Hi @ptrblck,

Unfortunately no, I don’t think so b/c in train1.py the devices are determined programmatically in the pytorch lightning pl.LightningModule by calling self.device.

Any other ideas why this error might be occuring?
Thanks!

ptrblck · August 12, 2020, 3:16am

I’m not familiar with Lightning so pinging @williamFalcon in case he’s seen something similar.

Are you seeing the same issue using “plain PyTorch” and if so could you post a code snippet we could debug?