Cuda.is_available() return False on Docker image pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel

shaolinkhoa · November 20, 2020, 7:05am

Hi,
I build my docker image from PyTorch image: pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel

My server:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04    Driver Version: 455.23.04    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:3B:00.0 Off |                  N/A |
|  0%   32C    P0   101W / 350W |      0MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:AF:00.0 Off |                  N/A |
| 30%   32C    P0    67W / 350W |      0MiB / 24268MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I’ve already installed NVIDIA Container Toolkit and restart the Docker
sudo systemctl restart docker

I can run nvidia-smi inside Docker container but when I try

sudo docker run --rm --gpus all khoa/pytorch:1.7 python -c 'import torch as t; print(t.cuda.is_available()); print(t.backends.cudnn.enabled)'

cuda.is_available() return False
while backends.cudnn.enabled return True

/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False
True

Can anyone help me?
I can’t run my code because of this.

ptrblck · November 21, 2020, 10:08am

The container is working for me so I guess your docker setup isn’t working properly.
Are you able to run any other container shipped with CUDA applications?

nvidia-docker run -it --ipc=host pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
root@d11e05a20388:/workspace# python -c "import torch; print(torch.cuda.is_available())"
True

shaolinkhoa · November 21, 2020, 10:27am

pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel also doesn’t work.

sudo docker run --rm --gpus all pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel python -c ‘import torch as t; print(t.cuda.is_available())’

False
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0

So weird, my other server (server B) has the same configuration that can run my image.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04    Driver Version: 455.23.04    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:3B:00.0 Off |                  N/A |
|  0%   38C    P8    34W / 350W |      0MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:AF:00.0 Off |                  N/A |
|  0%   32C    P8    34W / 350W |      0MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Meanwhile, server A has this problem after I rebooted it.
I also think it because of Docker. But I’ve tried
sudo systemctl restart docker
but still doesn’t work.
The different between sudo docker info
Server A:

Runtimes: runc nvidia
Kernel Version: 5.4.0-53-generic

Server B:

Runtimes: runc
Kernel Version: 5.4.0-52-generic

shaolinkhoa · November 24, 2020, 1:51pm

From this guide and I’m using nvidia-docker2 so I removed runc nvidia by deleting /etc/docker/daemon.json and reboot the server A.
But it still doesn’t work.

This problem happens because:
while training, there was a bug made the function divided 0, so my server was frozen/crashed.
So I ctrl+C the task and rebooted the server.
Now, cuda.is_available() always return False

Please help me.

ptrblck · November 24, 2020, 9:50pm

My best guess would be that (unwanted) updates might have been executed, which wiped the NVIDIA driver on your system after the restart. If that’s the case, you would have to reinstall them and recheck your container.

shaolinkhoa · November 25, 2020, 4:00pm

Thank you.
I reinstall the Nvidia driver and it works.
I should have done it from the beginning.