Docker: Error 804: forward compatibility (...)

Hi all,

I fail to get a minimal, bleeding edge, CUDA-enabled pytorch container working. I followed the build instructions on pytorch’s README. On the same machine, nvidia’s pytorch image works just fine, launched with the same docker run arguments. The only difference I could spot was CUDA 11.2 in nvidia’s image vs CUDA 11.3 in mine. Is this known not to work? I am rebuilding with CUDA 11.2 right now but it takes a while, so I figured I’d ask here in the meantime in case I am missing something obvious and someone sees this.

This is what I get (inside the container):

root@42afc3dc16b0:/# nvidia-smi
Fri May 28 15:00:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 710      On   | 00000000:09:00.0 N/A |                  N/A |
| 50%   43C    P8    N/A /  N/A |    222MiB /   973MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 00000000:0A:00.0 Off |                  N/A |
|  0%   29C    P8     8W / 210W |      6MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@42afc3dc16b0:/# python3 -c "import torch;print(torch.cuda.is_available())"
/opt/miniconda/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0
False

This is my Dockerfile:

FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 AS condainstall

COPY Miniconda3-py39_4.9.2-Linux-x86_64.sh /root/
RUN sh /root/Miniconda3-py39_4.9.2-Linux-x86_64.sh -b -p /opt/miniconda && \
    /opt/miniconda/bin/conda install -y\
        astunparse \
        numpy\
        ninja\
        pyyaml\
        mkl\
        mkl-include\
        setuptools\
        cmake\
        cffi\
        typing_extensions\
        future\
        six\
        requests\
        dataclasses && \
    /opt/miniconda/bin/conda clean --all

FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04 AS torchbuild

COPY --from=condainstall /opt/miniconda /opt/miniconda

# cloned from github
COPY pytorch /root/pytorch

WORKDIR /root/pytorch

RUN apt-get update && apt install gcc g++ -y

RUN eval "$(/opt/miniconda/bin/conda shell.bash hook)" && \
    export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} && \
    python3 setup.py install

FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04

COPY --from=torchbuild /opt/miniconda /opt/miniconda

ENV PATH /opt/miniconda/bin:$PATH

ENTRYPOINT ["python3", "-c", "import torch;print(torch.cuda.is_available())"]

Based on the error message I guess you might need to update your drivers on the bare metal node.

It turns out using CUDA 11.2 instead of 11.3 solved my problem.

Thanks for your reply @ptrblck. Updating the drivers on the bare metal might work too, but I am happy to just let debian’s APT handle the drivers on the bare metal node for now.