PyTorch fails to load libc10_cuda.so

mghtyclu · September 29, 2022, 8:15am

Hi,

Context: I need to use an old CUDA version (10.0) on a recent RTX30XX GPU. I am trying to build a container image for this purpose as the system uses CUDA 11.7. Since PyTorch support for the newer GPUs has only been added in recent versions I cannot find readily available images that combine CUDA10.0 and PyTorch >=1.7.

So I am trying to build my own container image, using the Dockerfile PyTorch provides. Out-of-the-box this didn’t work (some dependencies were not being pulled in, I needed to revert a CMake version bump since the base image’s CMake is too old), but with some tweaking I successfully built an image.

Using this image I get an error, however, when trying to execute PyTorch’s mnist example:

/opt/conda/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory

Which I don’t know how to fix. The libc10_cuda.so is present in the image, at the same location (only differs by the Python version) as in existing images, e.g., pytorch/pytorch:1.2-cuda10.0-cudnn7-devel, i.e. it is present in two places, /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so and /opt/conda/pkgs/pytorch-1.2.0-py3.6_cuda10.0.130_cudnn7.6.2_0/lib/python3.6/site-packages/torch/lib/libc10_cuda.so. The solutions for similar problems I found are usually using different builds of PyTorch, which isn’t applicable for my use-case unfortunately. How could I solve this problem?

The Dockerfile that I use to generate the image (I had to redact URLs due to this forum’s link limit):

Summary

# syntax = docker/dockerfile:experimental
#
# NOTE: To build this you will need a docker version > 18.06 with
#       experimental enabled and DOCKER_BUILDKIT=1
#
#       If you do not use buildkit you are not going to have a good time
#
ARG BASE_IMAGE=ubuntu:18.04
ARG PYTHON_VERSION=3.8

FROM ${BASE_IMAGE} as dev-base
RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
    apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        cmake \
        curl \
        git \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
ENV PATH /opt/conda/bin:$PATH

FROM dev-base as conda
ARG PYTHON_VERSION=3.8
RUN curl -fsSL -v -o ~/miniconda.sh -O  # miniconda URL #  && \
    chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh && \
    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda-build pyyaml numpy ipython typing-extensions && \
    /opt/conda/bin/conda clean -ya

FROM dev-base as submodule-update
WORKDIR /opt/pytorch
RUN git clone --recursive -b release/1.12 # PyTorch git URL #

FROM conda as build
WORKDIR /opt/pytorch
COPY --from=conda /opt/conda /opt/conda
COPY --from=submodule-update /opt/pytorch /opt/pytorch
RUN git revert -n 5cdf79fddc27368ebef0536db19cf6c64c4cf405  # Allow for CMake 3.10 instead of 3.13
RUN --mount=type=cache,target=/opt/ccache \
    TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    python setup.py install

FROM conda as conda-installs
ARG PYTHON_VERSION=3.8
ARG CUDA_VERSION=11.3
ARG CUDA_CHANNEL=nvidia
ARG INSTALL_CHANNEL=pytorch-nightly
ENV CONDA_OVERRIDE_CUDA=${CUDA_VERSION}
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y python=${PYTHON_VERSION} pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" && \
    /opt/conda/bin/conda clean -ya
RUN /opt/conda/bin/pip install torchelastic

FROM ${BASE_IMAGE} as official
ARG PYTORCH_VERSION
LABEL com.nvidia.volumes.needed="nvidia_driver"
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
    apt-get update && apt-get install -y --no-install-recommends \
        ca-certificates \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*
COPY --from=conda-installs /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV PYTORCH_VERSION ${PYTORCH_VERSION}
WORKDIR /workspace

FROM official as dev
# Should override the already installed version from the official-image stage
COPY --from=build /opt/conda /opt/conda

ptrblck · September 29, 2022, 8:17am

This won’t be possible since your Ampere GPU needs CUDA>=11.1.

mghtyclu · September 29, 2022, 8:21am

I was hoping to work around this by using 11.7 on the host and 10.0 in the container, I figured the container runtime should be able to negotiate it since it likely uses 11.7 internally, anyway.

ptrblck · September 29, 2022, 8:23am

No, bare metal would use an 11.7 driver while the container would use a 10.0 user mode driver (libcuda.so) and will fail. Ampere GPUs need CUDA 11 and even if you fix your build it’ll fail.

mghtyclu · September 29, 2022, 8:29am

Okay, then my assumptions are void. Thanks alot, that helps me avoid sinking more effort into this!

ptrblck · September 29, 2022, 8:30am

Sure! What’s your exact use case as maybe updating the code to be compatible with the latest PyTorch release (with e.g. CUDA 11.7) might be the better approach.
Are you stuck while porting the code somewhere?

mghtyclu · September 29, 2022, 8:35am

I am trying to provide the environment to a colleague who gave me the requirements, so I’m not familiar with the exact code. So far they avoided porting the code but given that it seems incompatible with our GPUs I have a better case than I did previously to argue for that. We’ll have to see how difficult this will be, but currently we are not (yet) stuck.