Hi,
I am trying to install CUDA 11.7 inside a docker (based on nvidia base docker: nvidia/cuda:11.7.0-devel-ubuntu20.04
) and install pytorch 1.12 on top of it.
I can see the docker build succeeds, but when the print torch.cuda.is_available()
it returns False. As a result any packages I build later tend to fail (e.g. apex installation returns No CUDA runtime found even though cuda 11.7 seems to be installed).
Here is some more diagnostic information I printed out during the docker build:
---> Running in 810601a714ed
print /usr/local dir:
total 4
drwxr-xr-x 1 root root 4096 Nov 7 17:11 bin
lrwxrwxrwx 1 root root 20 Nov 7 17:11 cuda -> /usr/local/cuda-11.7
lrwxrwxrwx 1 root root 25 Aug 12 00:21 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 1 root root 130 Aug 12 00:32 cuda-11.7
drwxr-xr-x 2 root root 6 Aug 1 13:22 etc
drwxr-xr-x 2 root root 6 Aug 1 13:22 games
drwxr-xr-x 2 root root 6 Aug 1 13:22 include
drwxr-xr-x 1 root root 23 Nov 7 17:09 lib
lrwxrwxrwx 1 root root 9 Aug 1 13:22 man -> share/man
lrwxrwxrwx 1 root root 24 Nov 7 17:08 mpi -> /usr/local/openmpi-4.0.1
drwxr-xr-x 1 root root 17 Nov 7 17:07 openmpi-4.0.1
drwxr-xr-x 2 root root 24 Aug 1 13:25 sbin
drwxr-xr-x 1 root root 17 Nov 7 17:03 share
drwxr-xr-x 2 root root 6 Aug 1 13:22 src
Removing intermediate container 810601a714ed
Step 43/64 : RUN echo "nvcc version: " && nvcc --version
---> Running in 41fd47d4089a
nvcc version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
Step 58/64 : RUN echo "print pytorch version: " && python3 -c "import torch; print(torch.__version__)"
---> Running in 63ae4acaf382
print pytorch version:
1.12.0a0+git664058f
Removing intermediate container 63ae4acaf382
---> 3099c4e3f35d
Step 59/64 : RUN echo "print pytorch nccl version: " && python3 -c "import torch; print(torch.cuda.nccl.version())"
---> Running in ab1d49d02685
print pytorch nccl version:
(2, 13, 4)
Removing intermediate container ab1d49d02685
---> 393a7ad5f55a
Step 60/64 : RUN echo "print pytorch cuda version: " && python3 -c "import torch; print(torch.version.cuda)"
---> Running in df47a3c0d236
print pytorch cuda version:
11.7
Removing intermediate container df47a3c0d236
---> ebe9529fce33
Step 61/64 : RUN echo "print torch.cuda.is_available? " && python3 -c "import torch; print(torch.cuda.is_available())"
---> Running in 6be9b3a6ff33
print torch.cuda.is_available?
False
I install pytorch as follows (and also add the /usr/loca/cuda/bin to PATH):
cd ${STAGE_DIR}/pytorch && git checkout v${PYTORCH_VERSION} && \
git submodule sync && git submodule update --init --recursive && \
export CUDA_HOME=/usr/local/cuda-11.7 && export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} && sudo rm -rf build && sudo NCCL_ROOT=${STAGE_DIR}/nccl/build NCCL_INCLUDE_DIR=${STAGE_DIR}/nccl/build/include NCCL_LIB_DIR=${STAGE_DIR}/nccl/build/lib CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc CMAKE_CUDA_ARCHITECTURES=all USE_SYSTEM_NCCL=1 python3 setup.py install
Anybody faced this issue or has suggestions on how to resolve this?