[UPDATE]
The solution was to restart the machine. I see back 4 GPUs and torch is working. Docker is still not working, but it is not relevant anymore.
Hi,
I am working with 4 GPUs machine, and one of them is started burning out. As a workaround for short time, before I can physically remove broken GPU I disabled it with nvidia-smi drain mode like in this solution. Now when I call nvidia-smi
I do see 3 GPUs instead of 4.
Now I set up tensorflow experiments on gpu 0 and 1 and they do run and use GPU.
When I want to put torch training and I want to do model.to('cuda:0')
, I get
model.to('cuda:0')
File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 612, in to
return self._apply(convert)
File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
param_applied = fn(param)
File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 610, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal
I did reinstall pytorch and cuda with conda uninstall
and conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
from torch official website.
Nevertheless I see same behavior like before installation.
python -c "import torch; print(torch.cuda.is_available())"
/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at /opt/conda/conda-bld/pytorch_1607370128159/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False
Additional info: I tried to run pytorch in docker, but got error that might point to the same direction
$ docker run --rm -it --gpus device=2 pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
$ docker run --rm -it --gpus all pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].```