Pytorch doesn't recognize cuda (cuda 11.7)

Hi,

I am trying to install CUDA 11.7 inside a docker (based on nvidia base docker: nvidia/cuda:11.7.0-devel-ubuntu20.04) and install pytorch 1.12 on top of it.

I can see the docker build succeeds, but when the print torch.cuda.is_available() it returns False. As a result any packages I build later tend to fail (e.g. apex installation returns No CUDA runtime found even though cuda 11.7 seems to be installed).

Here is some more diagnostic information I printed out during the docker build:

 ---> Running in 810601a714ed
print /usr/local dir:
total 4
drwxr-xr-x 1 root root 4096 Nov  7 17:11 bin
lrwxrwxrwx 1 root root   20 Nov  7 17:11 cuda -> /usr/local/cuda-11.7
lrwxrwxrwx 1 root root   25 Aug 12 00:21 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 1 root root  130 Aug 12 00:32 cuda-11.7
drwxr-xr-x 2 root root    6 Aug  1 13:22 etc
drwxr-xr-x 2 root root    6 Aug  1 13:22 games
drwxr-xr-x 2 root root    6 Aug  1 13:22 include
drwxr-xr-x 1 root root   23 Nov  7 17:09 lib
lrwxrwxrwx 1 root root    9 Aug  1 13:22 man -> share/man
lrwxrwxrwx 1 root root   24 Nov  7 17:08 mpi -> /usr/local/openmpi-4.0.1
drwxr-xr-x 1 root root   17 Nov  7 17:07 openmpi-4.0.1
drwxr-xr-x 2 root root   24 Aug  1 13:25 sbin
drwxr-xr-x 1 root root   17 Nov  7 17:03 share
drwxr-xr-x 2 root root    6 Aug  1 13:22 src
Removing intermediate container 810601a714ed


Step 43/64 : RUN echo "nvcc version: " && nvcc --version
 ---> Running in 41fd47d4089a
nvcc version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


Step 58/64 : RUN echo "print pytorch version: " && python3 -c "import torch; print(torch.__version__)"
 ---> Running in 63ae4acaf382
print pytorch version:
1.12.0a0+git664058f
Removing intermediate container 63ae4acaf382
 ---> 3099c4e3f35d
Step 59/64 : RUN echo "print pytorch nccl version: " && python3 -c "import torch; print(torch.cuda.nccl.version())"
 ---> Running in ab1d49d02685
print pytorch nccl version:
(2, 13, 4)
Removing intermediate container ab1d49d02685
 ---> 393a7ad5f55a
Step 60/64 : RUN echo "print pytorch cuda version: " && python3 -c "import torch; print(torch.version.cuda)"
 ---> Running in df47a3c0d236
print pytorch cuda version:
11.7
Removing intermediate container df47a3c0d236
 ---> ebe9529fce33
Step 61/64 : RUN echo "print torch.cuda.is_available? " && python3 -c "import torch; print(torch.cuda.is_available())"
 ---> Running in 6be9b3a6ff33
print torch.cuda.is_available?
False

I install pytorch as follows (and also add the /usr/loca/cuda/bin to PATH):

    cd ${STAGE_DIR}/pytorch && git checkout v${PYTORCH_VERSION} && \
    git submodule sync && git submodule update --init --recursive && \
    export CUDA_HOME=/usr/local/cuda-11.7 && export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} && sudo rm -rf build && sudo NCCL_ROOT=${STAGE_DIR}/nccl/build NCCL_INCLUDE_DIR=${STAGE_DIR}/nccl/build/include NCCL_LIB_DIR=${STAGE_DIR}/nccl/build/lib CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc CMAKE_CUDA_ARCHITECTURES=all USE_SYSTEM_NCCL=1 python3 setup.py install

Anybody faced this issue or has suggestions on how to resolve this?

@ptrblck I saw you replied to some related issues in the past, any pointers?

i also have the same issue but on windows 10

@params and @mohamed.samy.2248369 this Nvidia docker is working for me

FROM nvidia/cuda:11.0.3-base-ubuntu20.04

can you try this? and let me know