Error when building pytorch from source

When building from the source it throws up nvlink error. I have followed these instructions.

[386/3538] Performing build step for 'nccl_external'
FAILED: nccl_external-prefix/src/nccl_external-stamp/nccl_external-build nccl/lib/libnccl_static.a 
cd /codehub/external/pytorch/third_party/nccl/nccl && env CCACHE_DISABLE=1 SCCACHE_DISABLE=1 make CXX=/usr/bin/c++ CUDA_HOME=/usr/local/cuda NVCC=/usr/local/cuda/bin/nvcc "NVCC_GENCODE=-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_37,code=sm_37 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_53,code=sm_53 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75" BUILDDIR=/codehub/external/pytorch/build/nccl VERBOSE=0 -j && /root/anaconda3/bin/cmake -E touch /codehub/external/pytorch/build/nccl_external-prefix/src/nccl_external-stamp/nccl_external-build
make -C src build BUILDDIR=/codehub/external/pytorch/build/nccl
make[1]: Entering directory '/codehub/external/pytorch/third_party/nccl/nccl/src'
Generating nccl.h.in                           > /codehub/external/pytorch/build/nccl/include/nccl.h
Grabbing   include/nccl_net.h                  > /codehub/external/pytorch/build/nccl/include/nccl_net.h
Compiling  init.cc                             > /codehub/external/pytorch/build/nccl/obj/init.o
Compiling  channel.cc                          > /codehub/external/pytorch/build/nccl/obj/channel.o
Compiling  bootstrap.cc                        > /codehub/external/pytorch/build/nccl/obj/bootstrap.o
Compiling  transport.cc                        > /codehub/external/pytorch/build/nccl/obj/transport.o
Compiling  enqueue.cc                          > /codehub/external/pytorch/build/nccl/obj/enqueue.o
Compiling  misc/group.cc                       > /codehub/external/pytorch/build/nccl/obj/misc/group.o
Compiling  misc/nvmlwrap.cc                    > /codehub/external/pytorch/build/nccl/obj/misc/nvmlwrap.o
Compiling  misc/ibvwrap.cc                     > /codehub/external/pytorch/build/nccl/obj/misc/ibvwrap.o
Compiling  misc/rings.cc                       > /codehub/external/pytorch/build/nccl/obj/misc/rings.o
Compiling  misc/utils.cc                       > /codehub/external/pytorch/build/nccl/obj/misc/utils.o
Compiling  misc/argcheck.cc                    > /codehub/external/pytorch/build/nccl/obj/misc/argcheck.o
Compiling  misc/trees.cc                       > /codehub/external/pytorch/build/nccl/obj/misc/trees.o
Compiling  misc/topo.cc                        > /codehub/external/pytorch/build/nccl/obj/misc/topo.o
Compiling  transport/p2p.cc                    > /codehub/external/pytorch/build/nccl/obj/transport/p2p.o
Compiling  transport/shm.cc                    > /codehub/external/pytorch/build/nccl/obj/transport/shm.o
Compiling  transport/net.cc                    > /codehub/external/pytorch/build/nccl/obj/transport/net.o
Compiling  transport/net_socket.cc             > /codehub/external/pytorch/build/nccl/obj/transport/net_socket.o
In file included from bootstrap.cc:12:
include/socket.h: In function ‘ncclResult_t connectAddress(int*, socketAddress*)’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from bootstrap.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compiling  transport/net_ib.cc                 > /codehub/external/pytorch/build/nccl/obj/transport/net_ib.o
Compiling  collectives/all_reduce.cc           > /codehub/external/pytorch/build/nccl/obj/collectives/all_reduce.o
In file included from bootstrap.cc:12:
include/socket.h: In function ‘int findInterfaceMatchSubnet(char*, socketAddress*, socketAddress, int, int)’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from bootstrap.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compiling  collectives/all_gather.cc           > /codehub/external/pytorch/build/nccl/obj/collectives/all_gather.o
Compiling  collectives/broadcast.cc            > /codehub/external/pytorch/build/nccl/obj/collectives/broadcast.o
Compiling  collectives/reduce.cc               > /codehub/external/pytorch/build/nccl/obj/collectives/reduce.o
Compiling  collectives/reduce_scatter.cc       > /codehub/external/pytorch/build/nccl/obj/collectives/reduce_scatter.o
make[2]: Entering directory '/codehub/external/pytorch/third_party/nccl/nccl/src/collectives/device'
Generating nccl.pc.in                          > /codehub/external/pytorch/build/nccl/lib/pkgconfig/nccl.pc
Generating rules                               > /codehub/external/pytorch/build/nccl/obj/collectives/device/Makefile.rules
In file included from bootstrap.cc:12:
include/socket.h: In function ‘ncclResult_t bootstrapNetInit()’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from bootstrap.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bootstrap.cc:44:58: warning: ‘%s’ directive output may be truncated writing up to 1023 bytes into a region of size between 1017 and 1018 [-Wformat-truncation=]
           snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%s", i, bootstrapNetIfNames+i*MAX_IF_NAME_SIZE,
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               socketToString(&bootstrapNetIfAddrs[i].sa, addrline));
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from bootstrap.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:64:35: note: ‘__builtin___snprintf_chk’ output 6 or more bytes (assuming 1030) into a destination of size 1023
   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        __bos (__s), __fmt, __va_arg_pack ());
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_socket.cc:9:
include/socket.h: In function ‘ncclResult_t connectAddress(int*, socketAddress*)’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_socket.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_ib.cc:9:
include/socket.h: In function ‘ncclResult_t ncclIbInit(ncclDebugLogger_t)’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_ib.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_ib.cc:9:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_ib.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_socket.cc:9:
include/socket.h: In function ‘ncclResult_t ncclSocketInit(ncclDebugLogger_t)’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_socket.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_socket.cc:9:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_socket.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
transport/net_socket.cc:40:58: warning: ‘%s’ directive output may be truncated writing up to 1023 bytes into a region of size between 1017 and 1018 [-Wformat-truncation=]
           snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%s", i, ncclNetIfNames+i*MAX_IF_NAME_SIZE,
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
               socketToString(&ncclNetIfAddrs[i].sa, addrline));
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_socket.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:64:35: note: ‘__builtin___snprintf_chk’ output 6 or more bytes (assuming 1030) into a destination of size 1023
   return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        __bos (__s), __fmt, __va_arg_pack ());
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from transport/net_ib.cc:9:
include/socket.h: In function ‘ncclResult_t ncclIbConnect(int, void*, void**)’:
include/socket.h:41:16: warning: ‘<’ directive writing 1 byte into a region of size between 0 and 1024 [-Wformat-overflow=]
   sprintf(buf, "%s<%s>", host, service);
                ^~~~~~~~
In file included from /usr/include/stdio.h:862,
                 from include/debug.h:11,
                 from include/core.h:13,
                 from transport/net_ib.cc:8:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:34: note: ‘__builtin___sprintf_chk’ output between 3 and 1058 bytes into a destination of size 1024
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       __bos (__s), __fmt, __va_arg_pack ());
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_i8.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_u8.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_i32.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_u32.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_i64.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_u64.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_f16.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_f32.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_sum_f64.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_i8.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_u8.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_i32.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_u32.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_i64.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_u64.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_f16.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_f32.o
Compiling  all_gather.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_gather_prod_f64.o
.
.
.
.
.
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_min_f16.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_min_f32.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_min_f64.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_i8.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_u8.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_i32.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_u32.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_i64.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_u64.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_f16.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_f32.o
Compiling  all_reduce.cu                       > /codehub/external/pytorch/build/nccl/obj/collectives/device/all_reduce_max_f64.o
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_f648ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_f648ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_f328ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_f328ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_f168ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_f168ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_u648ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_u648ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_i648ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_i648ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_u328ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_u328ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterTreeLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z37ncclReduceScatterRingLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z36ncclReduceScatterTreeLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z36ncclReduceScatterRingLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z36ncclReduceScatterTreeLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z36ncclReduceScatterRingLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_Z24ncclAllReduceRing_sum_i8P14CollectiveArgs' with regcount of 96 (target: sm_53)
.
.
.
.
.
some 10000 lines of nvlink error
.
.
.
.
.
.
nvlink error   : entry function '_Z33ncclAllReduceTreeLLKernel_sum_u328ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z33ncclAllReduceRingLLKernel_sum_u328ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z33ncclAllReduceTreeLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z33ncclAllReduceRingLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z32ncclAllReduceTreeLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z32ncclAllReduceRingLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z32ncclAllReduceTreeLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
nvlink error   : entry function '_Z32ncclAllReduceRingLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_Z29ncclReduceScatterRing_max_f64P14CollectiveArgs' with regcount of 96 (target: sm_53)
Makefile:68: recipe for target '/codehub/external/pytorch/build/nccl/obj/collectives/device/devlink.o' failed
make[2]: *** [/codehub/external/pytorch/build/nccl/obj/collectives/device/devlink.o] Error 255
make[2]: Leaving directory '/codehub/external/pytorch/third_party/nccl/nccl/src/collectives/device'
Makefile:49: recipe for target '/codehub/external/pytorch/build/nccl/obj/collectives/device/colldevice.a' failed
make[1]: *** [/codehub/external/pytorch/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[1]: Leaving directory '/codehub/external/pytorch/third_party/nccl/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2
[387/3538] Generating src/x86_64-fma/blas/shdotxf.py.o
[388/3538] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_generic.dir/src/EmbeddingSpMDM.cc.o
[389/3538] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_generic.dir/src/EmbeddingSpMDMNBit.cc.o
[390/3538] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8Depthwise3DAvx2.cc.o
[391/3538] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwisePerChannelQuantAvx2.cc.o
[392/3538] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8Depthwise3x3Avx2.cc.o
[393/3538] Building CXX object third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o
ninja: build stopped: subcommand failed.
Building wheel torch-1.5.0a0+9857d9b
-- Building version 1.5.0a0+9857d9b
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/codehub/external/pytorch/torch -DCMAKE_PREFIX_PATH=/root/anaconda3 -DNUMPY_INCLUDE_DIR=/root/anaconda3/lib/python3.7/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/root/anaconda3/bin/python -DPYTHON_INCLUDE_DIR=/root/anaconda3/include/python3.7m -DPYTHON_LIBRARY=/root/anaconda3/lib/libpython3.7m.so.1.0 -DTORCH_BUILD_VERSION=1.5.0a0+9857d9b -DUSE_NUMPY=True /codehub/external/pytorch
cmake --build . --target install --config Release -- -j 8
Traceback (most recent call last):
  File "setup.py", line 737, in <module>
    build_deps()
  File "setup.py", line 316, in build_deps
    cmake=cmake)
  File "/codehub/external/pytorch/tools/build_pytorch_libs.py", line 62, in build_caffe2
    cmake.build(my_env)
  File "/codehub/external/pytorch/tools/setup_helpers/cmake.py", line 341, in build
    self.run(build_args, my_env)
  File "/codehub/external/pytorch/tools/setup_helpers/cmake.py", line 141, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/root/anaconda3/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '8']' returned non-zero exit status 1.

This summary is from build log.

******** Summary ********
-- General:
--   CMake version         : 3.14.0
--   CMake command         : /root/anaconda3/bin/cmake
--   System                : Linux
--   C++ compiler          : /usr/bin/c++
--   C++ compiler id       : GNU
--   C++ compiler version  : 8.3.0
--   BLAS                  : MKL
--   CXX flags             :  -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow
--   Build type            : Release
--   Compile definitions   : TH_BLAS_MKL;ONNX_ML=1;ONNX_NAMESPACE=onnx_torch;MAGMA_V2;IDEEP_USE_MKL;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1;MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
--   CMAKE_PREFIX_PATH     : /root/anaconda3;/usr/local/cuda
--   CMAKE_INSTALL_PREFIX  : /codehub/external/pytorch/torch
--   TORCH_VERSION         : 1.5.0
--   CAFFE2_VERSION        : 1.5.0
--   BUILD_CAFFE2_MOBILE   : ON
--   USE_STATIC_DISPATCH   : OFF
--   BUILD_BINARY          : OFF
--   BUILD_CUSTOM_PROTOBUF : ON
--     Link local protobuf : ON
--   BUILD_DOCS            : OFF
--   BUILD_PYTHON          : True
--     Python version      : 3.7.4
--     Python executable   : /root/anaconda3/bin/python
--     Pythonlibs version  : 3.7.4
--     Python library      : /root/anaconda3/lib/libpython3.7m.so.1.0
--     Python includes     : /root/anaconda3/include/python3.7m
--     Python site-packages: lib/python3.7/site-packages
--   BUILD_CAFFE2_OPS      : ON
--   BUILD_SHARED_LIBS     : ON
--   BUILD_TEST            : True
--   BUILD_JNI             : OFF
--   INTERN_BUILD_MOBILE   : 
--   USE_ASAN              : OFF
--   USE_CUDA              : ON
--     CUDA static link    : OFF
--     USE_CUDNN           : ON
--     CUDA version        : 10.1
--     cuDNN version       : 7.6.5
--     CUDA root directory : /usr/local/cuda
--     CUDA library        : /usr/local/cuda/lib64/stubs/libcuda.so
--     cudart library      : /usr/local/cuda/lib64/libcudart.so
--     cublas library      : /usr/lib/x86_64-linux-gnu/libcublas.so
--     cufft library       : /usr/local/cuda/lib64/libcufft.so
--     curand library      : /usr/local/cuda/lib64/libcurand.so
--     cuDNN library       : /usr/lib/x86_64-linux-gnu/libcudnn.so
--     nvrtc               : /usr/local/cuda/lib64/libnvrtc.so
--     CUDA include path   : /usr/local/cuda/include
--     NVCC executable     : /usr/local/cuda/bin/nvcc
--     NVCC flags          : -DONNX_NAMESPACE=onnx_torch;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-Xcudafe;--diag_suppress=cc_clobber_ignored;-Xcudafe;--diag_suppress=integer_sign_change;-Xcudafe;--diag_suppress=useless_using_declaration;-Xcudafe;--diag_suppress=set_but_not_used;-std=c++14;-Xcompiler;-fPIC;--expt-relaxed-constexpr;--expt-extended-lambda;-Wno-deprecated-gpu-targets;--expt-extended-lambda;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_53,code=sm_53;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-Xcompiler -fPIC;-DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__
--     CUDA host compiler  : /usr/bin/cc
--     USE_TENSORRT        : OFF
--   USE_ROCM              : OFF
--   USE_EIGEN_FOR_BLAS    : 
--   USE_FBGEMM            : ON
--   USE_FFMPEG            : OFF
--   USE_GFLAGS            : OFF
--   USE_GLOG              : OFF
--   USE_LEVELDB           : OFF
--   USE_LITE_PROTO        : OFF
--   USE_LMDB              : OFF
--   USE_METAL             : OFF
--   USE_MKL               : ON
--   USE_MKLDNN            : ON
--   USE_MKLDNN_CBLAS      : OFF
--   USE_NCCL              : ON
--     USE_SYSTEM_NCCL     : OFF
--   USE_NNPACK            : ON
--   USE_NUMPY             : ON
--   USE_OBSERVERS         : ON
--   USE_OPENCL            : OFF
--   USE_OPENCV            : OFF
--   USE_OPENMP            : ON
--   USE_TBB               : OFF
--   USE_PROF              : OFF
--   USE_QNNPACK           : ON
--   USE_REDIS             : OFF
--   USE_ROCKSDB           : OFF
--   USE_ZMQ               : OFF
--   USE_DISTRIBUTED       : ON
--     USE_MPI             : OFF
--     USE_GLOO            : ON
--   Public Dependencies  : Threads::Threads;caffe2::mkl;caffe2::mkldnn
--   Private Dependencies : qnnpack;pytorch_qnnpack;nnpack;cpuinfo;fbgemm;fp16;gloo;aten_op_header_gen;foxi_loader;rt;gcc_s;gcc;dl
-- Configuring done
CMake Warning at caffe2/CMakeLists.txt:622 (add_library):
  Cannot generate a safe runtime search path for target torch_cpu because
  files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libgomp.so.1] in /usr/lib/gcc/x86_64-linux-gnu/8 may be hidden by files in:
      /root/anaconda3/lib

  Some of these libraries may not be found correctly.


CMake Warning at cmake/Modules_CUDA_fix/upstream/FindCUDA.cmake:1847 (add_library):
  Cannot generate a safe linker search path for target
  caffe2_detectron_ops_gpu because files in some directories may conflict
  with libraries in implicit directories:

    link library [libgomp.so] in /usr/lib/gcc/x86_64-linux-gnu/8 may be hidden by files in:
      /root/anaconda3/lib

  Some of these libraries may not be found correctly.
Call Stack (most recent call first):
  modules/detectron/CMakeLists.txt:13 (CUDA_ADD_LIBRARY)


CMake Warning at cmake/Modules_CUDA_fix/upstream/FindCUDA.cmake:1847 (add_library):
  Cannot generate a safe runtime search path for target
  caffe2_detectron_ops_gpu because files in some directories may conflict
  with libraries in implicit directories:

    runtime library [libgomp.so.1] in /usr/lib/gcc/x86_64-linux-gnu/8 may be hidden by files in:
      /root/anaconda3/lib

  Some of these libraries may not be found correctly.
Call Stack (most recent call first):
  modules/detectron/CMakeLists.txt:13 (CUDA_ADD_LIBRARY)


-- Generating done
-- Build files have been written to: /codehub/external/pytorch/build

Compute capability 5.3 is not supported in NCCL, as it refers to the Tegra and Jetson platforms (related issue).

Did you specify the architectures before building PyTorch?

This is the output of deviceQuery.
It has compute capability of 5.0.

Device 0: "GeForce MX130"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2004 MBytes (2101870592 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1189 MHz (1.19 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1

I did not specify the architectures before building PyTorch.

That’s strange. Somehow -gencode=arch=compute_53,code=sm_53 is generated.
Could you try to build it via:

TORCH_CUDA_ARCH_LIST="5.0" python setup.py install
1 Like

Sure, I’ll try that.

This solves the nvlink error issue.

But while compiling some other error showed up.


[3064/3538] Building CXX object caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/cuda/nccl.cpp.o
../torch/csrc/cuda/nccl.cpp: In function ‘ncclDataType_t torch::cuda::nccl::detail::get_data_type(const at::Tensor&)’:
../torch/csrc/cuda/nccl.cpp:85:14: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
   if (t.type().backend() != Backend::CUDA) {
              ^
In file included from ../aten/src/ATen/Tensor.h:11,
                 from ../aten/src/ATen/Context.h:4,
                 from ../aten/src/ATen/ATen.h:5,
                 from ../torch/csrc/cuda/nccl.h:3,
                 from ../torch/csrc/cuda/nccl.cpp:1:
aten/src/ATen/core/TensorBody.h:244:30: note: declared here
   DeprecatedTypeProperties & type() const {
                              ^~~~
../torch/csrc/cuda/nccl.cpp: In function ‘void torch::cuda::nccl::detail::check_inputs(at::TensorList, at::TensorList, int, int)’:
../torch/csrc/cuda/nccl.cpp:129:30: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
   auto type = inputs[0].type();
                              ^
In file included from ../aten/src/ATen/Tensor.h:11,
                 from ../aten/src/ATen/Context.h:4,
                 from ../aten/src/ATen/ATen.h:5,
                 from ../torch/csrc/cuda/nccl.h:3,
                 from ../torch/csrc/cuda/nccl.cpp:1:
aten/src/ATen/core/TensorBody.h:244:30: note: declared here
   DeprecatedTypeProperties & type() const {
                              ^~~~
../torch/csrc/cuda/nccl.cpp:141:30: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
     if (!(type == input.type() && type == output.type())) {
                              ^
In file included from ../aten/src/ATen/Tensor.h:11,
                 from ../aten/src/ATen/Context.h:4,
                 from ../aten/src/ATen/ATen.h:5,
                 from ../torch/csrc/cuda/nccl.h:3,
                 from ../torch/csrc/cuda/nccl.cpp:1:
aten/src/ATen/core/TensorBody.h:244:30: note: declared here
   DeprecatedTypeProperties & type() const {
                              ^~~~
../torch/csrc/cuda/nccl.cpp:141:55: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
     if (!(type == input.type() && type == output.type())) {
                                                       ^
In file included from ../aten/src/ATen/Tensor.h:11,
                 from ../aten/src/ATen/Context.h:4,
                 from ../aten/src/ATen/ATen.h:5,
                 from ../torch/csrc/cuda/nccl.h:3,
                 from ../torch/csrc/cuda/nccl.cpp:1:
aten/src/ATen/core/TensorBody.h:244:30: note: declared here
   DeprecatedTypeProperties & type() const {
                              ^~~~
../torch/csrc/cuda/nccl.cpp: In function ‘bool torch::cuda::nccl::is_available(at::TensorList)’:
../torch/csrc/cuda/nccl.cpp:181:29: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
     auto type = tensor.type();
                             ^
In file included from ../aten/src/ATen/Tensor.h:11,
                 from ../aten/src/ATen/Context.h:4,
                 from ../aten/src/ATen/ATen.h:5,
                 from ../torch/csrc/cuda/nccl.h:3,
                 from ../torch/csrc/cuda/nccl.cpp:1:
aten/src/ATen/core/TensorBody.h:244:30: note: declared here
   DeprecatedTypeProperties & type() const {
                              ^~~~
[3078/3538] Linking CXX executable bin/utility_ops_gpu_test
FAILED: bin/utility_ops_gpu_test 
: && /usr/bin/c++  -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -rdynamic    -rdynamic caffe2/CMakeFiles/utility_ops_gpu_test.dir/operators/utility_ops_gpu_test.cc.o  -o bin/utility_ops_gpu_test -L/root/anaconda3/lib -Wl,-rpath,/usr/local/cuda/lib64:/root/anaconda3/lib:/codehub/external/pytorch/build/lib: /usr/local/cuda/lib64/libcudart.so lib/libgtest_main.a -Wl,--no-as-needed,/codehub/external/pytorch/build/lib/libtorch.so -Wl,--as-needed -Wl,--no-as-needed,/codehub/external/pytorch/build/lib/libtorch_cpu.so -Wl,--as-needed lib/libprotobuf.a -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -fopenmp -lpthread -lm -ldl lib/libmkldnn.a -Wl,--no-as-needed,/codehub/external/pytorch/build/lib/libtorch_cuda.so -Wl,--as-needed lib/libc10_cuda.so lib/libc10.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so -lcublas /usr/lib/x86_64-linux-gnu/libcudnn.so lib/libgtest.a -lpthread && :
/codehub/external/pytorch/build/lib/libtorch_cuda.so: undefined reference to `cudaSetupArgument'
/codehub/external/pytorch/build/lib/libtorch_cuda.so: undefined reference to `cudaConfigureCall'
/codehub/external/pytorch/build/lib/libtorch_cuda.so: undefined reference to `cudaLaunch'
collect2: error: ld returned 1 exit status
[3085/3538] Building CXX object caffe2/CMakeFiles/net_dag_utils_test.dir/core/net_dag_utils_test.cc.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "setup.py", line 737, in <module>
    build_deps()
  File "setup.py", line 316, in build_deps
    cmake=cmake)
  File "/codehub/external/pytorch/tools/build_pytorch_libs.py", line 62, in build_caffe2
    cmake.build(my_env)
  File "/codehub/external/pytorch/tools/setup_helpers/cmake.py", line 341, in build
    self.run(build_args, my_env)
  File "/codehub/external/pytorch/tools/setup_helpers/cmake.py", line 141, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/root/anaconda3/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '8']' returned non-zero exit status 1.

Could you try to clean the last failing build via python setup.py clean and rebuild?
Also, which nvcc --version are you using?

output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

I had cleaned the last failing build via python setup.py clean before rebuilding when I got the previous error.

I tried rebuilding again after cleaning the last failing build and it worked. Not sure what the problem was.

Thanks.

@ptrblck How should I go about installing torchvision after building pytorch from source.
If I install it via conda install torchvision -c pytorch or pip install torchvision it would downgrade pytorch version.
Do I have to also build torchvision from source?

Yes, you would have to build torchvision from source, which should be easier.
python setup.py install in the torchvision directory should do the job.

hi,
I too got similar error, while building for comute capability 3.0. GPU= nvidia quadro k4200.
i am building in conda environment. python 3.7.6, gcc-7(default gcc is 11),
Cuda-10.2, Cudnn-7.6.5, torch version = 1.6.0
tried to build latest version: successful but without cuda.
following all the steps carefully from GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
i get this error:

2 errors detected in the compilation of "/tmp/tmpxft_000945f5_00000000-6_reduce_scatter.cpp1.ii".
2 errors detected in the compilation of "/tmp/tmpxft_0009460d_00000000-6_reduce_scatter.cpp1.ii".
2 errors detected in the compilation of "/tmp/tmpxft_0009460a_00000000-6_reduce_scatter.cpp1.ii".
2 errors detected in the compilation of "/tmp/tmpxft_00094622_00000000-6_reduce_scatter.cpp1.ii".
make[2]: *** [/home/jaypatel/pytorch/build/nccl/obj/collectives/device/Makefile.rules:1329: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/reduce_scatter_avg_i64.o] Error 1
make[2]: *** [/home/jaypatel/pytorch/build/nccl/obj/collectives/device/Makefile.rules:1319: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/reduce_scatter_avg_i32.o] Error 1
2 errors detected in the compilation of "/tmp/tmpxft_00094624_00000000-6_reduce_scatter.cpp1.ii".
make[2]: *** [/home/jaypatel/pytorch/build/nccl/obj/collectives/device/Makefile.rules:1339: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/reduce_scatter_avg_f16.o] Error 1
make[2]: *** [/home/jaypatel/pytorch/build/nccl/obj/collectives/device/Makefile.rules:1324: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/reduce_scatter_avg_u32.o] Error 1
make[2]: *** [/home/jaypatel/pytorch/build/nccl/obj/collectives/device/Makefile.rules:1344: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/reduce_scatter_avg_f32.o] Error 1
2 errors detected in the compilation of "/tmp/tmpxft_00094628_00000000-6_reduce_scatter.cpp1.ii".
make[2]: *** [/home/jaypatel/pytorch/build/nccl/obj/collectives/device/Makefile.rules:1349: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/reduce_scatter_avg_f64.o] Error 1
make[2]: Leaving directory '/home/jaypatel/pytorch/third_party/nccl/nccl/src/collectives/device'
make[1]: *** [Makefile:50: /home/jaypatel/pytorch/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[1]: Leaving directory '/home/jaypatel/pytorch/third_party/nccl/nccl/src'
make: *** [Makefile:25: src.build] Error 2
[16/4351] Building C object confu-deps/XNNPACK/CMakeFiles/XNNPACK.dir/src/f32-vbinary/gen/vaddc-minmax-avx-x16.c.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "setup.py", line 732, in <module>
    build_deps()
  File "setup.py", line 316, in build_deps
    cmake=cmake)
  File "/home/jaypatel/pytorch/tools/build_pytorch_libs.py", line 62, in build_caffe2
    cmake.build(my_env)
  File "/home/jaypatel/pytorch/tools/setup_helpers/cmake.py", line 345, in build
    self.run(build_args, my_env)
  File "/home/jaypatel/pytorch/tools/setup_helpers/cmake.py", line 141, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/usr/local/anaconda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.

have also tried the steps mentioned above but didnt help got same error.
please help.
thanks in advance.

You might need to disable the NCCL build as the min. compute capability seems to be 3.5 based on this definition. Once this is done, you would then have to check if the current PyTorch source has the same min. compute capability.

can i use cuda if i disable NCCL?
i am not sure but i think had tried it but didnt work, will try again though.
how to check PyTorch source min compute capability?
sorry but am new to pytorch, have had used Tensorflow(of-course older version) earlier build from source successfully with same GPU.
should i be using older torch version to build because have read threads here some are able to build with v1.6.0(using git checkout) have had tried some older versions too but getting same error.

Yes, as NCCL is used for distributed workloads.

An easy way would be to try to build it for the specified architecture and see if the source code uses functions from a later compute capability. Another way would be to check for known functions introduced for specific architectures.

You could try that, as I don’t know which version supported 3.0. The current binaries ship with 3.7 to 8.6.

after disabling NCCL got this error:

[4162/5950] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/onnx/backend.cc.o
FAILED: caffe2/CMakeFiles/torch_cpu.dir/onnx/backend.cc.o 
/usr/bin/g++-7  -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -Iaten/src -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../caffe2/aten/src/TH -Icaffe2/aten/src/TH -I../caffe2/../torch/../aten/src -Icaffe2/aten/src -Icaffe2/../aten/src -Icaffe2/../aten/src/ATen -I../caffe2/../torch/csrc -I../caffe2/../torch/../third_party/miniz-2.0.8 -I../aten/src/TH -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -I../third_party/FXdiv/include -I../c10/.. -I../third_party/pthreadpool/include -I../third_party/cpuinfo/include -I../third_party/QNNPACK/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/src -I../third_party/cpuinfo/deps/clog/include -I../third_party/NNPACK/include -I../third_party/fbgemm/include -I../third_party/fbgemm -I../third_party/fbgemm/third_party/asmjit/src -I../third_party/FP16/include -I../third_party/tensorpipe -Ithird_party/tensorpipe -I../third_party/tensorpipe/third_party/libnop/include -I../third_party/fmt/include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /usr/local/anaconda/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /usr/local/anaconda/include/python3.7m -isystem /usr/local/anaconda/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem ../cmake/../third_party/cub -isystem include -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -fPIC   -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -std=c++14 -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -DCAFFE2_BUILD_MAIN_LIB -pthread -DASMJIT_STATIC -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/onnx/backend.cc.o -MF caffe2/CMakeFiles/torch_cpu.dir/onnx/backend.cc.o.d -o caffe2/CMakeFiles/torch_cpu.dir/onnx/backend.cc.o -c ../caffe2/onnx/backend.cc
../caffe2/onnx/backend.cc:11:10: fatal error: onnx/optimizer/optimize.h: No such file or directory
 #include "onnx/optimizer/optimize.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
[4165/5950] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/contrib/aten/aten_op.cc.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "setup.py", line 732, in <module>
    build_deps()
  File "setup.py", line 316, in build_deps
    cmake=cmake)
  File "/home/jaypatel/pytorch/tools/build_pytorch_libs.py", line 62, in build_caffe2
    cmake.build(my_env)
  File "/home/jaypatel/pytorch/tools/setup_helpers/cmake.py", line 345, in build
    self.run(build_args, my_env)
  File "/home/jaypatel/pytorch/tools/setup_helpers/cmake.py", line 141, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/usr/local/anaconda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release', '--', '-j', '4']' returned non-zero exit status 1.

It seems that this file is missing in your build:

fatal error: onnx/optimizer/optimize.h: No such file or directory

so make sure to update all submodules before trying to build.

I did notice that, restarted build from scratch, its building now, no errors as of now, taking too long though.
will update once build is complete.
Thanks for the help, really appreciate your time and effort.

I was finally able to Build torch version: 1.10.1 with COMPUTE CAPABILITY:3.0.
and also tested it, working charm. Thanks for all the help.

1 Like