Pytorch does not detect CUDA when used within miniconda docker

Yuhui_Shi · May 11, 2022, 10:36pm

What additional libs/steps do I need to include in my dockerfile so CUDA can be recognized when used within the container?

I tested the following things on an AWS g3.4xlarge EC2 instance, with AMI id ami-0e06eafbb1f01c15a (with cuda, cudnn, docker, and Nvidia-docker already set up)

The pytorch run within the container cannot detect the CUDA libs when using the miniconda base image.

docker run --rm --gpus all -it continuumio/miniconda3 /bin/bash

# within the container
$ conda create -n pytorch
$ conda activate pytorch

# installation command from https://pytorch.org/get-started/locally/
$ conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

# CUDA is not available
$ python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.181-99.354.amzn2.x86_64-x86_64-with-glibc2.31
Is CUDA available: False
~~~~~~~~~~~~~~~~~~~~~~~~~
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla M60
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] numpy                     1.21.5           py39he7a7128_2  
[conda] numpy-base                1.21.5           py39hf524024_2  
[conda] pytorch                   1.11.0          py3.9_cuda10.2_cudnn7.6.5_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.11.0               py39_cu102    pytorch
[conda] torchvision               0.12.0               py39_cu102    pytorch

nvidia-smi result from the host

Wed May 11 22:00:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P8    16W / 150W |      0MiB /  7618MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidia-smi result within the container shows CUDA version as NA

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P8    14W / 150W |      0MiB /  7618MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

On the other hand, if I use the official docker image nvidia/cuda and install conda and pytorch following the same step as above, it can be used normally

$ docker pull nvidia/cuda:11.6.0-runtime-ubuntu20.04

$ docker run --rm --gpus all -it nvidia/cuda:11.6.0-runtime-ubuntu20.04 /bin/bash

# within the container

# install miniconda 
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/miniconda

$ /opt/miniconda/bin/conda init

$ exec bash
$ conda create -n pytorch
$ conda activate pytorch

$ conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

# check if cuda is available
$ python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.181-99.354.amzn2.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
~~~~~~~~~~~~~~~~~~~~~~~~
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla M60
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] numpy                     1.21.5           py39he7a7128_2  
[conda] numpy-base                1.21.5           py39hf524024_2  
[conda] pytorch                   1.11.0          py3.9_cuda10.2_cudnn7.6.5_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch

nvidia-smi result when run within the container can detect the CUDA version.

But my understanding is that pytorch only requires a valid nvidia driver when installed using conda. conda will install cudatoolkit and cudnn along pytorch.
The only difference between the above two scenarios is if the base image comes with a CUDA cudnn installation.

My question is that, if I would like to use pytorch with CUDA and write my own dockerfile, do I have to extend nvidia’s base image in order to use it?