Hi,
Context: I need to use an old CUDA version (10.0) on a recent RTX30XX GPU. I am trying to build a container image for this purpose as the system uses CUDA 11.7. Since PyTorch support for the newer GPUs has only been added in recent versions I cannot find readily available images that combine CUDA10.0 and PyTorch >=1.7.
So I am trying to build my own container image, using the Dockerfile PyTorch provides. Out-of-the-box this didn’t work (some dependencies were not being pulled in, I needed to revert a CMake version bump since the base image’s CMake is too old), but with some tweaking I successfully built an image.
Using this image I get an error, however, when trying to execute PyTorch’s mnist example:
/opt/conda/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
Which I don’t know how to fix. The libc10_cuda.so
is present in the image, at the same location (only differs by the Python version) as in existing images, e.g., pytorch/pytorch:1.2-cuda10.0-cudnn7-devel
, i.e. it is present in two places, /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so
and /opt/conda/pkgs/pytorch-1.2.0-py3.6_cuda10.0.130_cudnn7.6.2_0/lib/python3.6/site-packages/torch/lib/libc10_cuda.so
. The solutions for similar problems I found are usually using different builds of PyTorch, which isn’t applicable for my use-case unfortunately. How could I solve this problem?
The Dockerfile that I use to generate the image (I had to redact URLs due to this forum’s link limit):
Summary
# syntax = docker/dockerfile:experimental
#
# NOTE: To build this you will need a docker version > 18.06 with
# experimental enabled and DOCKER_BUILDKIT=1
#
# If you do not use buildkit you are not going to have a good time
#
ARG BASE_IMAGE=ubuntu:18.04
ARG PYTHON_VERSION=3.8
FROM ${BASE_IMAGE} as dev-base
RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \
curl \
git \
libjpeg-dev \
libpng-dev && \
rm -rf /var/lib/apt/lists/*
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
ENV PATH /opt/conda/bin:$PATH
FROM dev-base as conda
ARG PYTHON_VERSION=3.8
RUN curl -fsSL -v -o ~/miniconda.sh -O # miniconda URL # && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda-build pyyaml numpy ipython typing-extensions && \
/opt/conda/bin/conda clean -ya
FROM dev-base as submodule-update
WORKDIR /opt/pytorch
RUN git clone --recursive -b release/1.12 # PyTorch git URL #
FROM conda as build
WORKDIR /opt/pytorch
COPY --from=conda /opt/conda /opt/conda
COPY --from=submodule-update /opt/pytorch /opt/pytorch
RUN git revert -n 5cdf79fddc27368ebef0536db19cf6c64c4cf405 # Allow for CMake 3.10 instead of 3.13
RUN --mount=type=cache,target=/opt/ccache \
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install
FROM conda as conda-installs
ARG PYTHON_VERSION=3.8
ARG CUDA_VERSION=11.3
ARG CUDA_CHANNEL=nvidia
ARG INSTALL_CHANNEL=pytorch-nightly
ENV CONDA_OVERRIDE_CUDA=${CUDA_VERSION}
RUN /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y python=${PYTHON_VERSION} pytorch torchvision torchtext "cudatoolkit=${CUDA_VERSION}" && \
/opt/conda/bin/conda clean -ya
RUN /opt/conda/bin/pip install torchelastic
FROM ${BASE_IMAGE} as official
ARG PYTORCH_VERSION
LABEL com.nvidia.volumes.needed="nvidia_driver"
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
libjpeg-dev \
libpng-dev && \
rm -rf /var/lib/apt/lists/*
COPY --from=conda-installs /opt/conda /opt/conda
ENV PATH /opt/conda/bin:$PATH
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
ENV PYTORCH_VERSION ${PYTORCH_VERSION}
WORKDIR /workspace
FROM official as dev
# Should override the already installed version from the official-image stage
COPY --from=build /opt/conda /opt/conda