Hi,
I build my docker image from PyTorch image: pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
My server:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:3B:00.0 Off | N/A |
| 0% 32C P0 101W / 350W | 0MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:AF:00.0 Off | N/A |
| 30% 32C P0 67W / 350W | 0MiB / 24268MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I’ve already installed NVIDIA Container Toolkit and restart the Docker sudo systemctl restart docker
I can run nvidia-smi inside Docker container but when I try
sudo docker run --rm --gpus all khoa/pytorch:1.7 python -c 'import torch as t; print(t.cuda.is_available()); print(t.backends.cudnn.enabled)'
cuda.is_available() return False
while backends.cudnn.enabled return True
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False
True
Can anyone help me?
I can’t run my code because of this.
The container is working for me so I guess your docker setup isn’t working properly.
Are you able to run any other container shipped with CUDA applications?
pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel also doesn’t work.
sudo docker run --rm --gpus all pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel python -c ‘import torch as t; print(t.cuda.is_available())’
False
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
So weird, my other server (server B) has the same configuration that can run my image.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:3B:00.0 Off | N/A |
| 0% 38C P8 34W / 350W | 0MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:AF:00.0 Off | N/A |
| 0% 32C P8 34W / 350W | 0MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Meanwhile, server A has this problem after I rebooted it.
I also think it because of Docker. But I’ve tried sudo systemctl restart docker
but still doesn’t work.
The different between sudo docker info Server A:
From this guide and I’m using nvidia-docker2 so I removed runc nvidia by deleting /etc/docker/daemon.json and reboot the server A.
But it still doesn’t work.
This problem happens because:
while training, there was a bug made the function divided 0, so my server was frozen/crashed.
So I ctrl+C the task and rebooted the server.
Now, cuda.is_available() always return False
My best guess would be that (unwanted) updates might have been executed, which wiped the NVIDIA driver on your system after the restart. If that’s the case, you would have to reinstall them and recheck your container.