How to creat Docker image from pytorch source

Sumac · November 8, 2021, 4:03am

I hope to make docker image for old GPU with pytorch1.8.
In this case, I should build pytorch from source.
So I refered official docs and tried making docker image.
But my docker image can’t detect GPU. (cuda.is_availabel() return False)

My system environment is as follows:
OS : Ubuntu18.04
GPU : Tesla K40C
CUDA : 10.2
Driver : 440.118.02
Docker : 19.03.12

The commands used for Dockerfile are as follows:

FROM nvidia/cuda:10.2-cudnn8-runtime-ubuntu18.04

RUN apt-get update \
    && apt-get install -y build-essential \
    && apt-get install -y ca-certificates \
    && apt-get install -y ccache \
    && apt-get install -y cmake \
    && apt-get install -y curl \
    && apt-get install -y file \
    && apt-get install -y sudo \
    && apt-get install -y git \
    && apt-get install -y locales \
    && locale-gen ja_JP.UTF-8
ENV LANG ja_JP.UTF-8
ENV LANGUAGE ja_JP:ja
ENV LC_ALL=ja_JP.UTF-8
RUN localedef -f UTF-8 -i ja_JP ja_JP.utf8

# Install nodejs
RUN curl -sL https://deb.nodesource.com/setup_14.x | sed ’s|https://|http://|’ | bash - \
    && sudo apt-get install -y nodejs

# Install MeCab, IPA, NEologd
RUN apt-get install -y mecab \
    && apt-get install -y libmecab-dev \
    && apt-get install -y mecab-ipadic \
    && apt-get install -y mecab-ipadic-utf8
RUN git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git \
    && mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y \
    && cp /etc/mecabrc /usr/local/etc/

# remove files
RUN apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# install miniforge
COPY Miniforge3-Linux-x86_64.sh /usr/local/src
RUN bash /usr/local/src/Miniforge3-Linux-x86_64.sh -b
ENV PATH $PATH:/root/miniforge3/bin
RUN conda install python=3.8 \
    && conda install -c conda-forge jupyterlab

# install packages by pip
COPY requirements_pip.txt /tmp
COPY requirements.txt /tmp
RUN pip3 install --upgrade pip \
    && pip3 install --no-cache-dir -r /tmp/requirements_pip.txt \
    && rm -rf ~/.cache/pip

# install packages by conda
RUN conda install --file /tmp/requirements.txt \
    && conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses \
    && conda install -c pytorch magma-cuda102 \
    && conda clean --all

# install torch from source
RUN git clone --recursive https://github.com/pytorch/pytorch
WORKDIR pytorch
ENV PYTORCH_BUILD_VERSION=1.8.2
ENV PYTORCH_BUILD_NUMBER=1
ENV USE_CUDA=1 USE_CUDNN=1
ENV TORCH_CUDA_ARCH_LIST="3.5" TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"
ENV MAX_JOBS=2
RUN python setup.py clean \
    && python setup.py install

WORKDIR /home/work/

CMD ["jupyter-lab", "--ip=0.0.0.0","--port=8888" ,"--no-browser", "--allow-root", "--LabApp.token=''"]

I built docker image by above commands and made container. (use --gpus all)
Container detected nvidia-smi command, but cuda.is_availabel() returned False.
So I wonder something is wrong with how to build pytorch from source.
How can I solve this problem?

ptrblck · November 8, 2021, 7:59am

Could you check the build logs and see if the CUDA toolkit was detected and used? You should also see the usage of nvcc to compile all CUDA source files.

Sumac · November 8, 2021, 8:26am

Thank you for your reply.

Do you mean to check all logs when I build docker image?
(It’s very long. So I can’t find where I should check.)

I checked nvcc -V in system environment and docker container.
In system environment, I can detect CUDA toolkit.
In docker container, command return nvcc: command not found.

Other docker container (that can use GPU) return same message.
Is there something wrong in my all Dockerfile?

ptrblck · November 8, 2021, 8:28am

This would mean that the CUDA compiler cannot be used inside the container and thus PyTorch also isn’t built with CUDA support.
Make sure you can execute nvcc in nvidia/cuda:10.2-cudnn8-runtime-ubuntu18.04 by launching it with nvidia-docker or with the --gpus all argument in newer docker versions.

Sumac · November 10, 2021, 2:11am

Sorry for the late reply.
I check nvcc in nvidia/cuda:10.2-cudnn8-runtime-ubuntu18.04.
It return nvcc: command not found.

I also check other image (nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04).
It return CUDA version correctly and cuda.is_availabel() returned True.
The cause of my problem seem to be docker image.

ptrblck · November 10, 2021, 6:36am

Yes, you are right and I’ve missed the runtime tag in your initial container. To build from source inside the container you would have to use the devel container. In any case, it’s good to hear you’ve solved the issue.

veritas · February 7, 2022, 6:32am

Please try using the following template (COI: I am the author).

It can help greatly with building PyTorch from source on many Linux platforms.

veritas · February 7, 2022, 6:32am

@Sumac It also includes a Docker Compose file to make development easier.