[SOLVED] Torch can't access cuda, RuntimeError: Unexpected error from cudaGetDeviceCount(). and Error 101: invalid device ordinal

[UPDATE]
The solution was to restart the machine. I see back 4 GPUs and torch is working. Docker is still not working, but it is not relevant anymore.


Hi,

I am working with 4 GPUs machine, and one of them is started burning out. As a workaround for short time, before I can physically remove broken GPU I disabled it with nvidia-smi drain mode like in this solution. Now when I call nvidia-smi I do see 3 GPUs instead of 4.

Now I set up tensorflow experiments on gpu 0 and 1 and they do run and use GPU.

When I want to put torch training and I want to do model.to('cuda:0'), I get

 model.to('cuda:0')
  File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in _apply
    param_applied = fn(param)
  File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 610, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal

I did reinstall pytorch and cuda with conda uninstall and conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch from torch official website.

Nevertheless I see same behavior like before installation.

python -c "import torch; print(torch.cuda.is_available())"
/ssd_sdb3/anaconda3/envs/py37gpu/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370128159/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
False

Additional info: I tried to run pytorch in docker, but got error that might point to the same direction

$ docker run --rm -it --gpus device=2 pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
$ docker run --rm -it --gpus all pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].```