PyTorch fails to load


Context: I need to use an old CUDA version (10.0) on a recent RTX30XX GPU. I am trying to build a container image for this purpose as the system uses CUDA 11.7. Since PyTorch support for the newer GPUs has only been added in recent versions I cannot find readily available images that combine CUDA10.0 and PyTorch >=1.7.

So I am trying to build my own container image, using the Dockerfile PyTorch provides. Out-of-the-box this didn’t work (some dependencies were not being pulled in, I needed to revert a CMake version bump since the base image’s CMake is too old), but with some tweaking I successfully built an image.

Using this image I get an error, however, when trying to execute PyTorch’s mnist example:

/opt/conda/lib/python3.8/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: cannot open shared object file: No such file or directory

Which I don’t know how to fix. The is present in the image, at the same location (only differs by the Python version) as in existing images, e.g., pytorch/pytorch:1.2-cuda10.0-cudnn7-devel, i.e. it is present in two places, /opt/conda/lib/python3.6/site-packages/torch/lib/ and /opt/conda/pkgs/pytorch-1.2.0-py3.6_cuda10.0.130_cudnn7.6.2_0/lib/python3.6/site-packages/torch/lib/ The solutions for similar problems I found are usually using different builds of PyTorch, which isn’t applicable for my use-case unfortunately. How could I solve this problem?

The Dockerfile that I use to generate the image (I had to redact URLs due to this forum’s link limit):

# syntax = docker/dockerfile:experimental
# NOTE: To build this you will need a docker version > 18.06 with
#       experimental enabled and DOCKER_BUILDKIT=1
#       If you do not use buildkit you are not going to have a good time
ARG BASE_IMAGE=ubuntu:18.04

FROM ${BASE_IMAGE} as dev-base
RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
    apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        cmake \
        curl \
        git \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
ENV PATH /opt/conda/bin:$PATH

FROM dev-base as conda
RUN curl -fsSL -v -o ~/ -O  # miniconda URL #  && \
    chmod +x ~/ && \
    ~/ -b -p /opt/conda && \
    rm ~/ && \
    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda-build pyyaml numpy ipython typing-extensions && \
    /opt/conda/bin/conda clean -ya

FROM dev-base as submodule-update
WORKDIR /opt/pytorch
RUN git clone --recursive -b release/1.12 # PyTorch git URL #

FROM conda as build
WORKDIR /opt/pytorch
COPY --from=conda /opt/conda /opt/conda
COPY --from=submodule-update /opt/pytorch /opt/pytorch
RUN git revert -n 5cdf79fddc27368ebef0536db19cf6c64c4cf405  # Allow for CMake 3.10 instead of 3.13
RUN --mount=type=cache,target=/opt/ccache \
    TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    python install

FROM conda as conda-installs
ARG INSTALL_CHANNEL=pytorch-nightly
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y python=${PYTHON_VERSION} pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" && \
    /opt/conda/bin/conda clean -ya
RUN /opt/conda/bin/pip install torchelastic

FROM ${BASE_IMAGE} as official
LABEL com.nvidia.volumes.needed="nvidia_driver"
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
    apt-get update && apt-get install -y --no-install-recommends \
        ca-certificates \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*
COPY --from=conda-installs /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
WORKDIR /workspace

FROM official as dev
# Should override the already installed version from the official-image stage
COPY --from=build /opt/conda /opt/conda

This won’t be possible since your Ampere GPU needs CUDA>=11.1.

I was hoping to work around this by using 11.7 on the host and 10.0 in the container, I figured the container runtime should be able to negotiate it since it likely uses 11.7 internally, anyway.

No, bare metal would use an 11.7 driver while the container would use a 10.0 user mode driver ( and will fail. Ampere GPUs need CUDA 11 and even if you fix your build it’ll fail.

Okay, then my assumptions are void. Thanks alot, that helps me avoid sinking more effort into this!

Sure! What’s your exact use case as maybe updating the code to be compatible with the latest PyTorch release (with e.g. CUDA 11.7) might be the better approach.
Are you stuck while porting the code somewhere?

I am trying to provide the environment to a colleague who gave me the requirements, so I’m not familiar with the exact code. So far they avoided porting the code but given that it seems incompatible with our GPUs I have a better case than I did previously to argue for that. We’ll have to see how difficult this will be, but currently we are not (yet) stuck.