Nvidia-smi is OK. torch.cuda.is_available() FAILS in Docker

Poke · October 8, 2022, 10:31pm

Setup: Windows 11 with Docker, WSL2, Ubuntu 20.04 docker image
So nvidia-smi works:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.76.02 Driver Version: 517.48 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
and all the usual info here.

But torch fails with error.
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)

I tried with latest nightly with 11.7 cuda:
python collect_env.py
Collecting environment information…
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:88: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.14.0.dev20221008+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: 11.7.99
GPU models and configuration:
All cards listed here

Nvidia driver version: 517.48
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.3
[pip3] torch==1.14.0.dev20221008+cu117
[pip3] torchaudio==0.13.0.dev20221006+cu117
[pip3] torchvision==0.15.0.dev20221008+cu117
[conda] Could not collect

Also tried with the latest pytorch with 11.6 cuda:
python collect_env.py
Collecting environment information…
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: 11.7.99
GPU models and configuration:
All cards listed here

Nvidia driver version: 517.48
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.3
[pip3] torch==1.12.1+cu116
[pip3] torchaudio==0.12.1+cu116
[pip3] torchvision==0.13.1+cu116
[conda] Could not collect

Reinstalled drivers on the host.

and I’m out of ideas.

ptrblck · October 24, 2022, 11:59pm

Could you check if the right drivers are installed for WSL2 and docker as described here, please?

Poke · October 25, 2022, 3:00pm

I reinstalled everything a few times.
Latest CUDA + latest Drivers
Older CUDA + latest Drivers
Older CUDA + older drivers
I am sure I am doing something wrong but don’t know what.