Issue Building Torchvision from Source

Ilia_Karmanov · October 2, 2020, 5:58pm

I’m building torch and torchvision from source since my system is fixed to CUDA 10.0 unfortunately and I need torch 1.6+

I am able to build and use torch fine , however I get an error when I add the torchvision build. A condensed version of my DockerFile is below:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
ARG PYTHON_VERSION=3.6
RUN curl -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
     chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh
# Use conda to install python and some packages
RUN /opt/conda/bin/conda install -y python=$PYTHON_VERSION

# Add conda python to the path
ENV PATH=/opt/conda/bin:$PATH

# Tools to build from source
RUN conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
# CUDA10
RUN conda install -c pytorch magma-cuda100  

# Compile torch (WORKS)
RUN cd / && \
        git clone --recursive https://github.com/pytorch/pytorch && \
        cd pytorch && \
        git submodule sync && \
        git submodule update --init --recursive && \
        export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} && \
        python setup.py install

# Compile torchvision (DOES NOT WORK)
RUN cd / && \
        git clone --recursive https://github.com/pytorch/vision.git && \
        cd vision && \
        python setup.py install

And the error I get doesn’t seem super useful:

Edit - ah I didn’t spot the real error …
Found no NVIDIA driver on your system

This is weird because torch installs fine (with cuda) but torchvision doesn’t?

I googled and saw that some install torchvision like so (whereas I didn’t specify TORCH_CUDA_ARCH_LIST):

ARG torchvision_tag='v0.5.0'
ARG torchvision_cuda='0'
RUN git clone --recursive https://github.com/pytorch/vision \
 && cd vision \
 && git checkout $torchvision_tag \
 && git submodule sync \
 && git submodule update --init --recursive
RUN cd vision \
 && . /opt/conda/bin/activate \
 && export TORCH_CUDA_ARCH_LIST="3.7;6.1;7.5" \
 && export FORCE_CUDA=$torchvision_cuda \
 && python setup.py install \
 && python setup.py bdist_wheel
RUN find vision -name '*.whl' \
 && cp vision/dist/*.whl /packages

So I can give that a go; I’m not sure about FORCE_CUDA but will try to set it to True first. I will also use the ARCH_LIST from the pytorch DockerFile which is TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" and thus a bit different.

Is it just the case that you list the CUDA capabilities you want and so if I wish to run on V100 then 7.0 and if also on 2080 then 7.5? So ideally it would be “7.0 7.5”?

Edit 2: Seems none of that helped.

Step 41/56 : RUN cd vision  && . /opt/conda/bin/activate  && export TORCH_CUDA_ARCH_LIST="3.7;6.1;7.0;7.5"  && export FORCE_CUDA=$torchvision_cuda  && python setup.py install
 ---> Running in ff2dd49d614a
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Building wheel torchvision-0.7.0a0+78ed10c

And then the full error:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 255, in <module>
    'clean': clean,
  File "/opt/conda/lib/python3.6/site-packages/setuptools/__init__.py", line 163, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install.py", line 67, in run
    self.do_egg_install()
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install.py", line 109, in do_egg_install
    self.run_command('bdist_egg')
  File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/bdist_egg.py", line 175, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/bdist_egg.py", line 161, in call_command
    self.run_command(cmdname)
  File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/opt/conda/lib/python3.6/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')
  File "/opt/conda/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/opt/conda/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 87, in run
    _build_ext.run(self)
  File "/opt/conda/lib/python3.6/distutils/command/build_ext.py", line 339, in run
    self.build_extensions()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 649, in build_extensions
    build_ext.build_extensions(self)
  File "/opt/conda/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
    self._build_extensions_serial()
  File "/opt/conda/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
    self.build_extension(ext)
  File "/opt/conda/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 208, in build_extension
    _build_ext.build_extension(self, ext)
  File "/opt/conda/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
    depends=ext.depends)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 478, in unix_wrap_ninja_compile
    with_cuda=with_cuda)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1233, in _write_ninja_file_and_compile_objects
    error_prefix='Error compiling objects for extension')
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1529, in _run_ninja_build
    raise RuntimeError(message)
RuntimeError: Error compiling objects for extension

ptrblck · October 4, 2020, 9:00am

CUDA is detected in /usr/local/cuda, which is the default CUDA install location in the container, so this shouldn’t be the root cause of the issue.

Could you post the complete torchvision install log so that we can take another look at it?

Ilia_Karmanov · October 4, 2020, 12:14pm

Thanks Patrick! Here is the complete log for vision (too long for post:)

ptrblck · October 4, 2020, 8:55pm

Thanks for the log.
It seems the first error is:

/vision/torchvision/csrc/cpu/decoder/stream.h:52:30: error: ‘findCodec’ declared as a ‘virtual’ field

I haven’t seen this failure before. Could you pull the latest torchvision, clean the build and rebuild?
Let me know, if you have still an error and I can try to reproduce it.

Ilia_Karmanov · October 5, 2020, 12:32pm

Hey Patrick, I tried again from scratch - my log. My dockerfile:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
ARG PYTHON_VERSION=3.6

ENV PATH=$PATH:/usr/local/nvidia/bin:/usr/local/cuda/bin
ENV CUDA_HOME=/usr/local/cuda
ENV CUDNN_INSTALL_PATH=/usr/local/cuda
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPT/lib64

RUN curl -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
     chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh

RUN /opt/conda/bin/conda install -y python=$PYTHON_VERSION

# Add conda python to the path
ENV PATH=/opt/conda/bin:$PATH

# Tools to build from source
RUN conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
RUN conda install -c pytorch magma-cuda100  

# https://github.com/MoonVision/moonbox-docker/blob/master/docker/builders/pytorch/Dockerfile#L31
RUN git clone --recursive https://github.com/pytorch/pytorch \
 && cd pytorch \
 && git submodule sync && git submodule update --init --recursive \
 && export TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" \
 && export TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
 && export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
 && python setup.py clean \
 && python setup.py install

RUN git clone --recursive https://github.com/pytorch/vision \
 && cd vision \
 && git submodule sync && git submodule update --init --recursive \
 && export TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" \
 && export TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
 && export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
 && python setup.py clean \
 && python setup.py install

Thanks again for the help.

Edit: Just to say if I comment out torchvision; torch does seem to work (at least the stuff without vision):

>>> torch.__version__
'1.8.0a0+f65ab89'
>>> torch.cuda.is_available()
True
>>>
>>> torch.cuda.get_device_name(0)
'GeForce RTX 2080 Ti'
>>> x = torch.randn(16,3,224,224).cuda()
>>> x2 = torch.nn.functional.avg_pool2d(x, (3,3))

My system details

Driver Version: 410.104     
CUDA Version: 10.0

ptrblck · October 6, 2020, 1:04am

Thanks for the Dockerfile.
After adding RUN apt-get install curl and the install command for git, I was able to build PyTorch as well as torchvision without any errors:

root@9c9d62ffdfcf:/# python
Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.8.0a0+4ab73c1'
>>> import torchvision
>>> torchvision.__version__
'0.8.0a0+217e26f'
>>>

Ilia_Karmanov · October 7, 2020, 8:01pm

Thanks for checking Patrick, I’m not sure why it didn’t work for me.

However I started from scratch and the below works:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

ARG PYTHON_VERSION=3.8
ARG WITH_TORCHVISION=1
ARG pytorch_tag='v1.6.0'
ARG torchvision_tag='v0.7.0'

RUN apt-get update && apt-get install -y --no-install-recommends \
         build-essential \
         cmake \
         git \
         curl \
         ca-certificates \
         libjpeg-dev \
         libpng-dev && \
     rm -rf /var/lib/apt/lists/*


RUN curl -o ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
     chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh && \
     /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include ninja cython typing && \
     /opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \
     /opt/conda/bin/conda clean -ya

ENV PATH /opt/conda/bin:$PATH

# This must be done before pip so that requirements.txt is available

WORKDIR /opt

RUN git clone https://github.com/pytorch/pytorch.git && cd pytorch && git checkout $pytorch_tag 
WORKDIR /opt/pytorch
RUN git submodule sync && git submodule update --init --recursive
RUN TORCH_CUDA_ARCH_LIST="7.0 7.5" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    pip install -v .

RUN if [ "$WITH_TORCHVISION" = "1" ] ; then git clone https://github.com/pytorch/vision.git && cd vision && git checkout $torchvision_tag && pip install -v . ; else echo "building without torchvision" ; fi

WORKDIR /workspace
RUN chmod -R a+w .

Build: DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile --tag XXX/torch16:20201007 --build-arg pytorch_tag='v1.6.0' --build-arg torchvision_tag='v0.7.0' --build-arg PYTHON_VERSION=3.8 .