Docker: torch.cuda.is_available() returns False

slmatrix · June 6, 2019, 8:21pm

> python3 -c "import torch; print(torch.cuda.is_available())"
If it matters, I installed pytorch, in my container, using pip3 from the instructions available here: https://pytorch.org/get-started/locally/

> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

> nvidia-smi

Thu Jun  6 20:10:41 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8    21W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   36C    P8    20W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The two commands above have the same outputs on both the host machine and my custom built container.

The only command that differs is the deviceQuery inside CUDA Samples. Yes, I rebooted my host, no luck. See below.

> ./bin/x86_64/linux/release/deviceQuery --------> [inside container]

./bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

> ./bin/x86_64/linux/release/deviceQuery | tail -n 1 --------> [on host]

Result = PASS

I run docker as follows:
> docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -it -v /home/containers/pytorch/:/home/pytorch/ pytorch:custom_build

slmatrix · June 8, 2019, 7:54pm

The problem seems to be isolated to building my own containers. If a cuda image is pulled from the official nvidia docker repository, everything will work.

docker run --runtime=nvidia --rm -it nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
`apt-get update && apt-get install -y cuda-samples-10-0`
cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
make
./deviceQuery

does infact work.

If anyone wants to recreate my problem:

docker run --runtime=nvidia -e NVIDIA_VISIBILE_DEVICES=all --rm -it ubuntu
apt update; apt install openssh-client
scp <username@ipaddress:/path/to/cuda10.0.deb/> .
dpkg -i <cuda10.0.deb>
apt install gnupg
apt-key add /var/<cuda10.0/key.pub>
apt update
apt install cuda
cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
make
./deviceQuery

Any filenames/text should be replaced with your own filenames, etc for that stuff.

I used scp because I had the cuda deb file saved on a remote machine and didn’t want to add volumes here.