Hi all,
I fail to get a minimal, bleeding edge, CUDA-enabled pytorch container working. I followed the build instructions on pytorch’s README. On the same machine, nvidia’s pytorch image works just fine, launched with the same docker run
arguments. The only difference I could spot was CUDA 11.2 in nvidia’s image vs CUDA 11.3 in mine. Is this known not to work? I am rebuilding with CUDA 11.2 right now but it takes a while, so I figured I’d ask here in the meantime in case I am missing something obvious and someone sees this.
This is what I get (inside the container):
root@42afc3dc16b0:/# nvidia-smi
Fri May 28 15:00:14 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 710 On | 00000000:09:00.0 N/A | N/A |
| 50% 43C P8 N/A / N/A | 222MiB / 973MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 On | 00000000:0A:00.0 Off | N/A |
| 0% 29C P8 8W / 210W | 6MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@42afc3dc16b0:/# python3 -c "import torch;print(torch.cuda.is_available())"
/opt/miniconda/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False
This is my Dockerfile:
FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 AS condainstall
COPY Miniconda3-py39_4.9.2-Linux-x86_64.sh /root/
RUN sh /root/Miniconda3-py39_4.9.2-Linux-x86_64.sh -b -p /opt/miniconda && \
/opt/miniconda/bin/conda install -y\
astunparse \
numpy\
ninja\
pyyaml\
mkl\
mkl-include\
setuptools\
cmake\
cffi\
typing_extensions\
future\
six\
requests\
dataclasses && \
/opt/miniconda/bin/conda clean --all
FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 AS torchbuild
COPY --from=condainstall /opt/miniconda /opt/miniconda
# cloned from github
COPY pytorch /root/pytorch
WORKDIR /root/pytorch
RUN apt-get update && apt install gcc g++ -y
RUN eval "$(/opt/miniconda/bin/conda shell.bash hook)" && \
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} && \
python3 setup.py install
FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
COPY --from=torchbuild /opt/miniconda /opt/miniconda
ENV PATH /opt/miniconda/bin:$PATH
ENTRYPOINT ["python3", "-c", "import torch;print(torch.cuda.is_available())"]