Install PyTorch from source with Cuda 10.2

Veril · January 16, 2020, 4:14pm

I have a compatibility 3.0 card so I always have to install from source for it.

But there is no magma for 10.2 so what do people do if they want to install from source?

When installing cuda on fedora it installs the latest, downgrading it would be a major hassle.

Edit: When proceeding without magma as suggested the install fails after seconds:

-- Generating done
-- Build files have been written to: /home/aaa/000git/pytorch/build
cmake --build . --target install --config Release -- -j 8
[16/3522] Performing build step for 'nccl_external'
FAILED: nccl_external-prefix/src/nccl_external-stamp/nccl_external-build nccl/lib/libnccl_static.a 
cd /home/aaa/000git/pytorch/third_party/nccl/nccl && env CCACHE_DISABLE=1 SCCACHE_DISABLE=1 make CXX=/usr/bin/c++ CUDA_HOME=/usr/local/cuda NVCC=/usr/local/cuda/bin/nvcc NVCC_GENCODE=-gencode=arch=compute_30,code=sm_30 BUILDDIR=/home/aaa/000git/pytorch/build/nccl VERBOSE=0 -j && /home/aaa/anaconda3/bin/cmake -E touch /home/aaa/000git/pytorch/build/nccl_external-prefix/src/nccl_external-stamp/nccl_external-build
make -C src build BUILDDIR=/home/aaa/000git/pytorch/build/nccl
make[1]: Entering directory '/home/aaa/000git/pytorch/third_party/nccl/nccl/src'
Grabbing   include/nccl_net.h                  > /home/aaa/000git/pytorch/build/nccl/include/nccl_net.h
Compiling  init.cc                             > /home/aaa/000git/pytorch/build/nccl/obj/init.o
Generating nccl.h.in                           > /home/aaa/000git/pytorch/build/nccl/include/nccl.h
Compiling  channel.cc                          > /home/aaa/000git/pytorch/build/nccl/obj/channel.o
Compiling  bootstrap.cc                        > /home/aaa/000git/pytorch/build/nccl/obj/bootstrap.o
Compiling  transport.cc                        > /home/aaa/000git/pytorch/build/nccl/obj/transport.o
Compiling  enqueue.cc                          > /home/aaa/000git/pytorch/build/nccl/obj/enqueue.o
Compiling  misc/group.cc                       > /home/aaa/000git/pytorch/build/nccl/obj/misc/group.o
Compiling  misc/nvmlwrap.cc                    > /home/aaa/000git/pytorch/build/nccl/obj/misc/nvmlwrap.o
Compiling  misc/rings.cc                       > /home/aaa/000git/pytorch/build/nccl/obj/misc/rings.o
Compiling  misc/ibvwrap.cc                     > /home/aaa/000git/pytorch/build/nccl/obj/misc/ibvwrap.o
Compiling  misc/argcheck.cc                    > /home/aaa/000git/pytorch/build/nccl/obj/misc/argcheck.o
Compiling  misc/utils.cc                       > /home/aaa/000git/pytorch/build/nccl/obj/misc/utils.o
Compiling  misc/trees.cc                       > /home/aaa/000git/pytorch/build/nccl/obj/misc/trees.o
Compiling  misc/topo.cc                        > /home/aaa/000git/pytorch/build/nccl/obj/misc/topo.o
Compiling  transport/p2p.cc                    > /home/aaa/000git/pytorch/build/nccl/obj/transport/p2p.o
Compiling  transport/shm.cc                    > /home/aaa/000git/pytorch/build/nccl/obj/transport/shm.o
Compiling  transport/net.cc                    > /home/aaa/000git/pytorch/build/nccl/obj/transport/net.o
Compiling  transport/net_socket.cc             > /home/aaa/000git/pytorch/build/nccl/obj/transport/net_socket.o
Compiling  transport/net_ib.cc                 > /home/aaa/000git/pytorch/build/nccl/obj/transport/net_ib.o
Compiling  collectives/all_reduce.cc           > /home/aaa/000git/pytorch/build/nccl/obj/collectives/all_reduce.o
Compiling  collectives/all_gather.cc           > /home/aaa/000git/pytorch/build/nccl/obj/collectives/all_gather.o
Compiling  collectives/broadcast.cc            > /home/aaa/000git/pytorch/build/nccl/obj/collectives/broadcast.o
Compiling  collectives/reduce.cc               > /home/aaa/000git/pytorch/build/nccl/obj/collectives/reduce.o
Compiling  collectives/reduce_scatter.cc       > /home/aaa/000git/pytorch/build/nccl/obj/collectives/reduce_scatter.o
Generating nccl.pc.in                          > /home/aaa/000git/pytorch/build/nccl/lib/pkgconfig/nccl.pc
make[2]: Entering directory '/home/aaa/000git/pytorch/third_party/nccl/nccl/src/collectives/device'
Generating rules                               > /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/Makefile.rules
In file included from bootstrap.cc:12:
include/socket.h: In function ‘ncclResult_t connectAddress(int*, socketAddress*)’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/socket.h: In function ‘int findInterfaceMatchSubnet(char*, socketAddress*, socketAddress, int, int)’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_socket.cc:9:
include/socket.h: In function ‘ncclResult_t connectAddress(int*, socketAddress*)’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
include/socket.h: In function ‘ncclResult_t ncclSocketInit(ncclDebugLogger_t)’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
transport/net_socket.cc:40:67: warning: ‘%s’ directive output may be truncated writing up to 1023 bytes into a region of size between 1017 and 1018 [-Wformat-truncation=]
   40 |           snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%s", i, ncclNetIfNames+i*MAX_IF_NAME_SIZE,
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   41 |               socketToString(&ncclNetIfAddrs[i].sa, addrline));
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~     
transport/net_socket.cc:40:19: note: ‘snprintf’ output 6 or more bytes (assuming 1030) into a destination of size 1023
   40 |           snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%s", i, ncclNetIfNames+i*MAX_IF_NAME_SIZE,
      |           ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   41 |               socketToString(&ncclNetIfAddrs[i].sa, addrline));
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from transport/net_ib.cc:9:
include/socket.h: In function ‘int findInterfaces(const char*, char*, socketAddress*, int, int, int)’:
include/socket.h:108:14: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 16 equals destination size [-Wstringop-truncation]
  108 |       strncpy(names+found*maxIfNameSize, interface->ifa_name, maxIfNameSize);
      |       ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/socket.h:108:14: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 16 equals destination size [-Wstringop-truncation]
include/socket.h: In function ‘ncclResult_t bootstrapNetInit()’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bootstrap.cc:44:67: warning: ‘%s’ directive output may be truncated writing up to 1023 bytes into a region of size between 1017 and 1018 [-Wformat-truncation=]
   44 |           snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%s", i, bootstrapNetIfNames+i*MAX_IF_NAME_SIZE,
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   45 |               socketToString(&bootstrapNetIfAddrs[i].sa, addrline));
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bootstrap.cc:44:19: note: ‘snprintf’ output 6 or more bytes (assuming 1030) into a destination of size 1023
   44 |           snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%s", i, bootstrapNetIfNames+i*MAX_IF_NAME_SIZE,
      |           ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   45 |               socketToString(&bootstrapNetIfAddrs[i].sa, addrline));
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[2]: *** [Makefile:53: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/all_gather.dep] Error 1
make[2]: *** Waiting for unfinished jobs....
include/socket.h: In function ‘ncclResult_t ncclIbInit(ncclDebugLogger_t)’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
make[2]: *** [Makefile:53: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/all_reduce.dep] Error 1
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘int findInterfaces(const char*, char*, socketAddress*, int, int, int)’,
    inlined from ‘int findInterfaces(char*, socketAddress*, int, int)’ at include/socket.h:294:26,
    inlined from ‘ncclResult_t ncclIbInit(ncclDebugLogger_t)’ at transport/net_ib.cc:97:25:
include/socket.h:108:14: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 16 equals destination size [-Wstringop-truncation]
  108 |       strncpy(names+found*maxIfNameSize, interface->ifa_name, maxIfNameSize);
      |       ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/socket.h:108:14: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 16 equals destination size [-Wstringop-truncation]
In function ‘int findInterfaceMatchSubnet(char*, socketAddress*, socketAddress, int, int)’,
    inlined from ‘int findInterfaces(char*, socketAddress*, int, int)’ at include/socket.h:302:40,
    inlined from ‘ncclResult_t ncclIbInit(ncclDebugLogger_t)’ at transport/net_ib.cc:97:25:
include/socket.h:189:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 16 equals destination size [-Wstringop-truncation]
  189 |     strncpy(ifNames+found*ifNameMaxSize, interface->ifa_name, ifNameMaxSize);
      |     ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[2]: *** [Makefile:53: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/functions.dep] Error 1
make[2]: *** [Makefile:53: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/reduce_scatter.dep] Error 1
make[2]: *** [Makefile:53: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/broadcast.dep] Error 1
make[2]: *** [Makefile:53: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/reduce.dep] Error 1
make[2]: Leaving directory '/home/aaa/000git/pytorch/third_party/nccl/nccl/src/collectives/device'
make[1]: *** [Makefile:49: /home/aaa/000git/pytorch/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[1]: *** Waiting for unfinished jobs....
include/socket.h: In function ‘ncclResult_t connectAddress(int*, socketAddress*)’:
include/socket.h:41:19: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   41 |   sprintf(buf, "%s<%s>", host, service);
      |                   ^
include/socket.h:41:10: note: ‘sprintf’ output between 3 and 1058 bytes into a destination of size 1024
   41 |   sprintf(buf, "%s<%s>", host, service);
      |   ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from init.cc:10:
include/param.h: In function ‘void setEnvFile(const char*)’:
include/param.h:37:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 1024 equals destination size [-Wstringop-truncation]
   37 |     strncpy(envValue, line+s, 1024);
      |     ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
make[1]: Leaving directory '/home/aaa/000git/pytorch/third_party/nccl/nccl/src'
make: *** [Makefile:25: src.build] Error 2
[23/3522] Building CXX object third_party/protobuf/cmake/CMakeFiles/libprotobuf-lite.dir/__/src/google/protobuf/extension_set.cc.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "setup.py", line 737, in <module>
    build_deps()
  File "setup.py", line 316, in build_deps
    cmake=cmake)
  File "/home/aaa/000git/pytorch/tools/build_pytorch_libs.py", line 62, in build_caffe2
    cmake.build(my_env)
  File "/home/aaa/000git/pytorch/tools/setup_helpers/cmake.py", line 339, in build
    self.run(build_args, my_env)
  File "/home/aaa/000git/pytorch/tools/setup_helpers/cmake.py", line 141, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/home/aaa/anaconda3/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '8']' returned non-zero exit status 1.

albanD · January 16, 2020, 4:17pm

Hi,

Do you use functions like inverse and eigenvalues? If not, you don’t need magma.
If you do need it, then you will have to recompile it if you cannot find an existing binary for it.

Veril · January 16, 2020, 4:37pm

Perhaps unrelated but install fails completely.

Aanconda Python 3.7 Fedora 31 with CUDA 10.2
with libcudnn7 libcudnn7-devel libnccl libnccl-devel
installed from official repo:
https://developer.download.nvidia.com/compute/machine-learning/repos/rhel7/x86_64/nvidia-machine-learning-repo-rhel7-1.0.0-1.x86_64.rpm
as per:
https://rpmfusion.org/Howto/CUDA

albanD · January 16, 2020, 4:45pm

What fails completely? The cuda install? Or installing torch after?
Can you give some logs please?

Veril · January 16, 2020, 4:46pm

I edited the stdout during torch install into my main post earlier.

Was following instructions from the pytorch github main page.

albanD · January 16, 2020, 4:58pm

Hi,

If you look for the error in your log, you find: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
So I guess you need a lower version of gcc, or try and use clang?

Veril · January 16, 2020, 5:01pm

How do I tell the setup to use clang?

albanD · January 16, 2020, 5:05pm

I’m afraid I’m not very familiar with fedora. You can check online, but most likely a package install.

Then you can specify the compiler with CC=/path/to/clang python setup.py install.

Veril · January 16, 2020, 5:20pm

Guess I would have to downgrade gcc since with clang it can’t find openmp even though the library is installed.

Edit:
After installing libomp-devel the problems on that front ended.
But:
unsupported clang version! clang version must be less than 9 and greater than 3.2

Nvidia is truly cancer

Veril · January 16, 2020, 9:19pm

PyTroch is now successfully installed and functioning.

Solution:

dnf install https://negativo17.org/repos/nvidia/fedora-31/x86_64/cuda-gcc-8.3.0-1.fc31.x86_64.rpm
dnf install https://negativo17.org/repos/nvidia/fedora-31/x86_64/cuda-gcc-c++-8.3.0-1.fc31.x86_64.rpm
CC=cuda-gcc CXX=cuda-g++ python setup.py install