I have studyed similar topic method to solve the problem, but nothing change. I use the code in I have 3 gpu, why torch.cuda.device_count() only return '1'.
Run the script below:
import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
# call(["nvcc", "--version"]) does not work
#! nvcc --version
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())
__Python VERSION: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
[GCC 7.3.0]
__pyTorch VERSION: 1.4.0
__Number CUDA Devices: 1
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 11 MiB, 32469 MiB
1, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
2, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
3, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
4, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
5, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
6, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
7, Tesla V100-PCIE-32GB, 418.56, 32480 MiB, 0 MiB, 32480 MiB
Active CUDA Device: GPU 0
Available devices 1
Current cuda device 0
import pycuda
from pycuda import compiler
import pycuda.driver as drv
print("%d device(s) found." % drv.Device.count())
for ordinal in range(drv.Device.count()):
dev = drv.Device(ordinal)
print (ordinal, dev.name())
1 device(s) found.
0 Tesla V100-PCIE-32GB
When I use --gpu=1 or anyother gpu index except 0.
It will break with message:
File "/opt/conda/envs/pytorch-ci/lib/python3.6/site-packages/torch/cuda/__init__.py", line 292, in set_device
RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59
Btw, all the script run in a docker container.
So somebody can help?