Compile PyTorch source code with cuda enabled got error

using python setup.py develop to install pytorch from latest source code, and gets tons of error message relative with cuda, which works find previously, anyone can give some hint to make it work, thanks!

/tmp/tmpxft_00059808_00000000-6_ActivationGeluKernel.cudafe1.stub.c:549: error: template argument 7 is invalid
/tmp/tmpxft_00059808_00000000-6_ActivationGeluKernel.cudafe1.stub.c:549: error: insufficient contextual information to determine type
  549 | static void __nv_cudaEntityRegisterCallback(void **__T2693){__nv_dummy_param_ref(__T2693);__nv_save_fatbinhandle_for_managed_rt(__T2693);__cudaRegisterEntry(__T2693, ((void ( *)(int,  ...

nvcc_internal_extended_lambda_implementation:217:8: note: provided for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’
In file included from /tmp/tmpxft_00059808_00000000-6_ActivationGeluKernel.cudafe1.stub.c:8,
                 from tmpxft_00059808_00000000-6_ActivationGeluKernel.cudafe1.stub.c:1:
/tmp/tmpxft_00059808_00000000-6_ActivationGeluKernel.cudafe1.stub.c:549: error: insufficient contextual information to determine type
  549 | static void __nv_cudaEntityRegisterCallback(void **__T2693){__nv_dummy_param_ref(__T2693);__nv_save_fatbinhandle_for_managed_rt(__T2693);__cudaRegisterEntry(__T2693, ((void ( *)(int,  ...

/tmp/tmpxft_00059808_00000000-6_ActivationGeluKernel.cudafe1.stub.c:549: error: insufficient contextual information to determine type
  549 | static void __nv_cudaEntityRegisterCallback(void **__T2693){__nv_dummy_param_ref(__T2693);__nv_save_fatbinhandle_for_managed_rt(__T2693);__cudaRegisterEntry(__T2693, ((void ( *)(int,  ....

[7923/8804] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/Utils.cu.o
[7924/8804] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/cub-RadixSortKeys.cu.o

[7925/8804] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/cub-RadixSortPairs.cu.o
ninja: build stopped: subcommand failed.

here’s env info:

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: Tesla T4
GPU 1: Tesla T4

Nvidia driver version: 545.23.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Gold 6151 CPU @ 3.00GHz
CPU family:                         6
Model:                              85
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           4
BogoMIPS:                           6000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat md_clear flush_l1d
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          256 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           8 MiB (8 instances)
L3 cache:                           24.8 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:             Vulnerable
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] flake8==6.1.0
[pip3] flake8-bugbear==23.3.23
[pip3] flake8-comprehensions==3.15.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-pyi==23.3.1
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.0
[pip3] optree==0.12.1
[pip3] pytorch-triton==3.0.0+45fff310c8
[conda] magma-cuda110             2.5.2                         1    pytorch
[conda] mkl-include               2024.1.0              intel_691    intel
[conda] mkl-static                2024.1.0              intel_691    intel
[conda] numpy                     1.26.0                   pypi_0    pypi
[conda] optree                    0.12.1                   pypi_0    pypi
[conda] pytorch-triton            3.0.0+45fff310c8          pypi_0    pypi
[conda] torchfix                  0.4.0                    pypi_0    pypi

Which PyTorch commit/version and CUDA version are you trying to use?

cuda version is here

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

pytorch I got commit id in main branch(08/13)

$ python setup.py develop
Building wheel torch-2.5.0a0+git4d11a9b
-- Building version 2.5.0a0+git4d11a9b
cmake --build . --target install --config Release

Thanks for your help!

Getting more output in here

[1/896] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o
FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o
/usr/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/zongzesheng/code/origin/pytorch/build/aten/src -I/home/zongzesheng/code/origin/pytorch/aten/src -I/home/zongzesheng/code/origin/pytorch/build -I/home/zongzesheng/code/origin/pytorch -I/home/zongzesheng/code/origin/pytorch/cmake/../third_party/benchmark/include -I/home/zongzesheng/code/origin/pytorch/third_party/onnx -I/home/zongzesheng/code/origin/pytorch/build/third_party/onnx -I/home/zongzesheng/code/origin/pytorch/nlohmann -I/home/zongzesheng/code/origin/pytorch/aten/src/THC -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/cuda -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/zongzesheng/code/origin/pytorch/build/caffe2/aten/src -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/.. -I/home/zongzesheng/code/origin/pytorch/build/nccl/include -I/home/zongzesheng/code/origin/pytorch/c10/cuda/../.. -I/home/zongzesheng/code/origin/pytorch/c10/.. -I/usr/local/cuda/include -I/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe -I/home/zongzesheng/code/origin/pytorch/build/third_party/tensorpipe -I/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/zongzesheng/code/origin/pytorch/torch/csrc/api -I/home/zongzesheng/code/origin/pytorch/torch/csrc/api/include -isystem /home/zongzesheng/code/origin/pytorch/build/third_party/gloo -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/gloo -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/zongzesheng/code/origin/pytorch/third_party/protobuf/src -isystem /home/zongzesheng/software/miniconda3/include -isystem /home/zongzesheng/code/origin/pytorch/third_party/XNNPACK/include -isystem /home/zongzesheng/code/origin/pytorch/third_party/ittapi/include -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/eigen -isystem /home/zongzesheng/code/origin/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/zongzesheng/code/origin/pytorch/third_party/ideep/include -isystem /home/zongzesheng/code/origin/pytorch/INTERFACE -isystem /home/zongzesheng/code/origin/pytorch/third_party/nlohmann/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_75,code=sm_75 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o.d -x cu -c /home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o
In file included from tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:1:
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:12:202: error: type/value mismatch at argument 3 in template parameter list for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’
   12 | typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)(   ::at::TensorIteratorBase &, const  ::c10::Scalar &),(&    ::at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> _ZZZZN67_INTERNAL_8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_5229632at6native69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_52296317hardshrink_kernelERNS0_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_;
      |                                                                                                                                                                                                          ^
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:12:202: note:   expected a constant of type ‘bool’, got ‘__nv_dl_tag<void (*)(at::TensorIteratorBase&, const c10::Scalar&), at::native::_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_522963::hardshrink_kernel, 1>’
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:288: error: type/value mismatch at argument 3 in template parameter list for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’
   13 | typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> &),(& :: at::native::gpu_kernel_impl< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail::Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>, ::at::detail::Array< ::c10::ScalarType, (int)2> > _ZZN2at6native15gpu_kernel_implIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_52296317hardshrink_kernelERNS_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_EEvS4_RKT_EUliE_;
      |                                                                                                                                                                                                                                                                                                ^
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:288: note:   expected a constant of type ‘bool’, got ‘__nv_dl_tag<void (*)(at::TensorIteratorBase&, const c10::Scalar&), at::native::_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_522963::hardshrink_kernel, 1>’
In file included from tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:1:
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:513: error: type/value mismatch at argument 3 in template parameter list for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’
   13 | typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> &),(& :: at::native::gpu_kernel_impl< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail::Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>, ::at::detail::Array< ::c10::ScalarType, (int)2> > _ZZN2at6native15gpu_kernel_implIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_52296317hardshrink_kernelERNS_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_EEvS4_RKT_EUliE_;
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^~~~~~
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:513: note:   expected a constant of type ‘bool’, got ‘__nv_dl_tag<void (*)(at::TensorIteratorBase&, const c10::Scalar&), at::native::_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_522963::hardshrink_kernel, 1>’
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:526: error: no matches converting function ‘gpu_kernel_impl’ to type ‘void (*)(struct at::TensorIteratorBase&, const int&)’
   13 | typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> &),(& :: at::native::gpu_kernel_impl< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail::Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>, ::at::detail::Array< ::c10::ScalarType, (int)2> > _ZZN2at6native15gpu_kernel_implIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_52296317hardshrink_kernelERNS_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_EEvS4_RKT_EUliE_;
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh:283:1: note: candidate is: ‘template<class func_t> void at::native::gpu_kernel_impl(at::TensorIteratorBase&, const func_t&)’
  283 | void gpu_kernel_impl(TensorIteratorBase& iter, const func_t& f) {
      | ^~~~~~~~~~~~~~~
In file included from tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:1:
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:831: error: type/value mismatch at argument 3 in template parameter list for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’
   13 | typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> &),(& :: at::native::gpu_kernel_impl< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail::Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>, ::at::detail::Array< ::c10::ScalarType, (int)2> > _ZZN2at6native15gpu_kernel_implIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_52296317hardshrink_kernelERNS_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_EEvS4_RKT_EUliE_;
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ^
/tmp/tmpxft_0007fa02_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:831: note:   expected a constant of type ‘bool’, got ‘__nv_dl_tag<void (*)(at::TensorIteratorBase&, const c10::Scalar&), at::native::_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_522963::hardshrink_kernel, 1>’

Execute nvcc command manually with verbose:

/usr/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/zongzesheng/code/origin/pytorch/build/aten/src -I/home/zongzesheng/code/origin/pytorch/aten/src -I/home/zongzesheng/code/origin/pytorch/build -I/home/zongzesheng/code/origin/pytorch -I/home/zongzesheng/code/origin/pytorch/cmake/../third_party/benchmark/include -I/home/zongzesheng/code/origin/pytorch/third_party/onnx -I/home/zongzesheng/code/origin/pytorch/build/third_party/onnx -I/home/zongzesheng/code/origin/pytorch/nlohmann -I/home/zongzesheng/code/origin/pytorch/aten/src/THC -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/cuda -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/zongzesheng/code/origin/pytorch/build/caffe2/aten/src -I/home/zongzesheng/code/origin/pytorch/aten/src/ATen/.. -I/home/zongzesheng/code/origin/pytorch/build/nccl/include -I/home/zongzesheng/code/origin/pytorch/c10/cuda/../.. -I/home/zongzesheng/code/origin/pytorch/c10/.. -I/usr/local/cuda/include -I/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe -I/home/zongzesheng/code/origin/pytorch/build/third_party/tensorpipe -I/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/zongzesheng/code/origin/pytorch/torch/csrc/api -I/home/zongzesheng/code/origin/pytorch/torch/csrc/api/include -isystem /home/zongzesheng/code/origin/pytorch/build/third_party/gloo -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/gloo -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/zongzesheng/code/origin/pytorch/third_party/protobuf/src -isystem /home/zongzesheng/software/miniconda3/include -isystem /home/zongzesheng/code/origin/pytorch/third_party/XNNPACK/include -isystem /home/zongzesheng/code/origin/pytorch/third_party/ittapi/include -isystem /home/zongzesheng/code/origin/pytorch/cmake/../third_party/eigen -isystem /home/zongzesheng/code/origin/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/zongzesheng/code/origin/pytorch/third_party/ideep/include -isystem /home/zongzesheng/code/origin/pytorch/INTERFACE -isystem /home/zongzesheng/code/origin/pytorch/third_party/nlohmann/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_75,code=sm_75 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o.d -x cu -c /home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o --verbose
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/usr/lib/nvidia-cuda-toolkit/bin
#$ _THERE_=/usr/lib/nvidia-cuda-toolkit/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_SIZE_=64
#$ NVVMIR_LIBRARY_DIR=/usr/lib/nvidia-cuda-toolkit/libdevice
#$ PATH=/usr/lib/nvidia-cuda-toolkit/bin:/usr/local/cuda/bin:/home/zongzesheng/software/cmake/bin:/home/zongzesheng/software/act:/home/zongzesheng/software/miniconda3/bin:/home/zongzesheng/software/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
#$ LIBRARIES=  -L/usr/lib/x86_64-linux-gnu/stubs -L/usr/lib/x86_64-linux-gnu
#$ gcc -std=c++17 -D__CUDA_ARCH__=750 -D__CUDA_ARCH_LIST__=750 -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_EXTENDED_LAMBDA__ -D__CUDACC_RELAXED_CONSTEXPR__  -fPIC -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -O3 -I"/home/zongzesheng/code/origin/pytorch/build/aten/src" -I"/home/zongzesheng/code/origin/pytorch/aten/src" -I"/home/zongzesheng/code/origin/pytorch/build" -I"/home/zongzesheng/code/origin/pytorch" -I"/home/zongzesheng/code/origin/pytorch/cmake/../third_party/benchmark/include" -I"/home/zongzesheng/code/origin/pytorch/third_party/onnx" -I"/home/zongzesheng/code/origin/pytorch/build/third_party/onnx" -I"/home/zongzesheng/code/origin/pytorch/nlohmann" -I"/home/zongzesheng/code/origin/pytorch/aten/src/THC" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/cuda" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/include" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include" -I"/home/zongzesheng/code/origin/pytorch/build/caffe2/aten/src" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/.." -I"/home/zongzesheng/code/origin/pytorch/build/nccl/include" -I"/home/zongzesheng/code/origin/pytorch/c10/cuda/../.." -I"/home/zongzesheng/code/origin/pytorch/c10/.." -I"/usr/local/cuda/include" -I"/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe" -I"/home/zongzesheng/code/origin/pytorch/build/third_party/tensorpipe" -I"/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe/third_party/libnop/include" -I"/home/zongzesheng/code/origin/pytorch/torch/csrc/api" -I"/home/zongzesheng/code/origin/pytorch/torch/csrc/api/include" -isystem "/home/zongzesheng/code/origin/pytorch/build/third_party/gloo" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/gloo" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googlemock/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googletest/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/protobuf/src" -isystem "/home/zongzesheng/software/miniconda3/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/XNNPACK/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ittapi/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/eigen" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ideep/include" -isystem "/home/zongzesheng/code/origin/pytorch/INTERFACE" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/nlohmann/include"  -D "AT_PER_OPERATOR_HEADERS" -D "FLASHATTENTION_DISABLE_ALIBI" -D "HAVE_MALLOC_USABLE_SIZE=1" -D "HAVE_MMAP=1" -D "HAVE_SHM_OPEN=1" -D "HAVE_SHM_UNLINK=1" -D "IDEEP_USE_MKL" -D "MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS" -D "ONNXIFI_ENABLE_EXT=1" -D "ONNX_ML=1" -D "ONNX_NAMESPACE=onnx_torch" -D "TORCH_CUDA_BUILD_MAIN_LIB" -D "USE_C10D_GLOO" -D "USE_C10D_NCCL" -D "USE_CUDA" -D "USE_DISTRIBUTED" -D "USE_EXTERNAL_MZCRC" -D "USE_FLASH_ATTENTION" -D "USE_MEM_EFF_ATTENTION" -D "USE_NCCL" -D "USE_RPC" -D "USE_TENSORPIPE" -D "_FILE_OFFSET_BITS=64" -D "torch_cuda_EXPORTS" -D "LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS" -D "_GLIBCXX_USE_CXX11_ABI=1" -D "ONNX_NAMESPACE=onnx_torch" -D "CUB_WRAPPED_NAMESPACE=at_cuda_detail" -D "CUDA_HAS_FP16=1" -D "__CUDA_NO_HALF_OPERATORS__" -D "__CUDA_NO_HALF_CONVERSIONS__" -D "__CUDA_NO_HALF2_OPERATORS__" -D "__CUDA_NO_BFLOAT16_CONVERSIONS__" -D "NDEBUG" -D "MKL_HAS_SBGEMM" -D "MKL_HAS_SHGEMM" -D "TORCH_USE_LIBUV" -D "CAFFE2_USE_GLOO" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=119 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu" -o "/tmp/tmpxft_000826c0_00000000-7_ActivationHardshrinkKernel.cpp1.ii"
#$ gcc -std=c++17 -D__CUDA_ARCH_LIST__=750 -E -x c++ -D__CUDACC__ -D__NVCC__ -D__CUDACC_EXTENDED_LAMBDA__ -D__CUDACC_RELAXED_CONSTEXPR__  -fPIC -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -O3 -I"/home/zongzesheng/code/origin/pytorch/build/aten/src" -I"/home/zongzesheng/code/origin/pytorch/aten/src" -I"/home/zongzesheng/code/origin/pytorch/build" -I"/home/zongzesheng/code/origin/pytorch" -I"/home/zongzesheng/code/origin/pytorch/cmake/../third_party/benchmark/include" -I"/home/zongzesheng/code/origin/pytorch/third_party/onnx" -I"/home/zongzesheng/code/origin/pytorch/build/third_party/onnx" -I"/home/zongzesheng/code/origin/pytorch/nlohmann" -I"/home/zongzesheng/code/origin/pytorch/aten/src/THC" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/cuda" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/include" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include" -I"/home/zongzesheng/code/origin/pytorch/build/caffe2/aten/src" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/.." -I"/home/zongzesheng/code/origin/pytorch/build/nccl/include" -I"/home/zongzesheng/code/origin/pytorch/c10/cuda/../.." -I"/home/zongzesheng/code/origin/pytorch/c10/.." -I"/usr/local/cuda/include" -I"/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe" -I"/home/zongzesheng/code/origin/pytorch/build/third_party/tensorpipe" -I"/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe/third_party/libnop/include" -I"/home/zongzesheng/code/origin/pytorch/torch/csrc/api" -I"/home/zongzesheng/code/origin/pytorch/torch/csrc/api/include" -isystem "/home/zongzesheng/code/origin/pytorch/build/third_party/gloo" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/gloo" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googlemock/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googletest/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/protobuf/src" -isystem "/home/zongzesheng/software/miniconda3/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/XNNPACK/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ittapi/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/eigen" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ideep/include" -isystem "/home/zongzesheng/code/origin/pytorch/INTERFACE" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/nlohmann/include"  -D "AT_PER_OPERATOR_HEADERS" -D "FLASHATTENTION_DISABLE_ALIBI" -D "HAVE_MALLOC_USABLE_SIZE=1" -D "HAVE_MMAP=1" -D "HAVE_SHM_OPEN=1" -D "HAVE_SHM_UNLINK=1" -D "IDEEP_USE_MKL" -D "MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS" -D "ONNXIFI_ENABLE_EXT=1" -D "ONNX_ML=1" -D "ONNX_NAMESPACE=onnx_torch" -D "TORCH_CUDA_BUILD_MAIN_LIB" -D "USE_C10D_GLOO" -D "USE_C10D_NCCL" -D "USE_CUDA" -D "USE_DISTRIBUTED" -D "USE_EXTERNAL_MZCRC" -D "USE_FLASH_ATTENTION" -D "USE_MEM_EFF_ATTENTION" -D "USE_NCCL" -D "USE_RPC" -D "USE_TENSORPIPE" -D "_FILE_OFFSET_BITS=64" -D "torch_cuda_EXPORTS" -D "LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS" -D "_GLIBCXX_USE_CXX11_ABI=1" -D "ONNX_NAMESPACE=onnx_torch" -D "CUB_WRAPPED_NAMESPACE=at_cuda_detail" -D "CUDA_HAS_FP16=1" -D "__CUDA_NO_HALF_OPERATORS__" -D "__CUDA_NO_HALF_CONVERSIONS__" -D "__CUDA_NO_HALF2_OPERATORS__" -D "__CUDA_NO_BFLOAT16_CONVERSIONS__" -D "NDEBUG" -D "MKL_HAS_SBGEMM" -D "MKL_HAS_SHGEMM" -D "TORCH_USE_LIBUV" -D "CAFFE2_USE_GLOO" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5 -D__CUDACC_VER_BUILD__=119 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu" -o "/tmp/tmpxft_000826c0_00000000-5_ActivationHardshrinkKernel.cpp4.ii"
#$ -- Filter Dependencies -- > caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o.d
#$ cicc --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu" --orig_src_path_name "/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu" --allow_managed --extended-lambda --relaxed_constexpr --diag_suppress=cc_clobber_ignored --diag_suppress=field_without_dll_interface --diag_suppress=base_class_has_different_dll_interface --diag_suppress=dll_interface_conflict_none_assumed --diag_suppress=dll_interface_conflict_dllexport_assumed --diag_suppress=bad_friend_decl  -arch compute_75 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_000826c0_00000000-3_ActivationHardshrinkKernel.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_000826c0_00000000-4_ActivationHardshrinkKernel.module_id" --gen_c_file_name "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.c" --stub_file_name "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.gpu"  "/tmp/tmpxft_000826c0_00000000-7_ActivationHardshrinkKernel.cpp1.ii" -o "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.ptx"
#$ ptxas -arch=sm_75 -m64 "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.ptx"  -o "/tmp/tmpxft_000826c0_00000000-8_ActivationHardshrinkKernel.cubin"
#$ fatbinary -64 -compress-all --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=75,file=/tmp/tmpxft_000826c0_00000000-8_ActivationHardshrinkKernel.cubin" --embedded-fatbin="/tmp/tmpxft_000826c0_00000000-3_ActivationHardshrinkKernel.fatbin.c"
#$ rm /tmp/tmpxft_000826c0_00000000-3_ActivationHardshrinkKernel.fatbin
#$ cudafe++ --c++17 --gnu_version=110400 --display_error_number --orig_src_file_name "/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu" --orig_src_path_name "/home/zongzesheng/code/origin/pytorch/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu" --allow_managed --extended-lambda --relaxed_constexpr --diag_suppress=cc_clobber_ignored --diag_suppress=field_without_dll_interface --diag_suppress=base_class_has_different_dll_interface --diag_suppress=dll_interface_conflict_none_assumed --diag_suppress=dll_interface_conflict_dllexport_assumed --diag_suppress=bad_friend_decl --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.cpp" --stub_file_name "tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_000826c0_00000000-4_ActivationHardshrinkKernel.module_id" "/tmp/tmpxft_000826c0_00000000-5_ActivationHardshrinkKernel.cpp4.ii"
#$ gcc -std=c++17 -D__CUDA_ARCH__=750 -D__CUDA_ARCH_LIST__=750 -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -fPIC -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -O3 -I"/home/zongzesheng/code/origin/pytorch/build/aten/src" -I"/home/zongzesheng/code/origin/pytorch/aten/src" -I"/home/zongzesheng/code/origin/pytorch/build" -I"/home/zongzesheng/code/origin/pytorch" -I"/home/zongzesheng/code/origin/pytorch/cmake/../third_party/benchmark/include" -I"/home/zongzesheng/code/origin/pytorch/third_party/onnx" -I"/home/zongzesheng/code/origin/pytorch/build/third_party/onnx" -I"/home/zongzesheng/code/origin/pytorch/nlohmann" -I"/home/zongzesheng/code/origin/pytorch/aten/src/THC" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/cuda" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/include" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include" -I"/home/zongzesheng/code/origin/pytorch/build/caffe2/aten/src" -I"/home/zongzesheng/code/origin/pytorch/aten/src/ATen/.." -I"/home/zongzesheng/code/origin/pytorch/build/nccl/include" -I"/home/zongzesheng/code/origin/pytorch/c10/cuda/../.." -I"/home/zongzesheng/code/origin/pytorch/c10/.." -I"/usr/local/cuda/include" -I"/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe" -I"/home/zongzesheng/code/origin/pytorch/build/third_party/tensorpipe" -I"/home/zongzesheng/code/origin/pytorch/third_party/tensorpipe/third_party/libnop/include" -I"/home/zongzesheng/code/origin/pytorch/torch/csrc/api" -I"/home/zongzesheng/code/origin/pytorch/torch/csrc/api/include" -isystem "/home/zongzesheng/code/origin/pytorch/build/third_party/gloo" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/gloo" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googlemock/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/googletest/googletest/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/protobuf/src" -isystem "/home/zongzesheng/software/miniconda3/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/XNNPACK/include" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ittapi/include" -isystem "/home/zongzesheng/code/origin/pytorch/cmake/../third_party/eigen" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/ideep/include" -isystem "/home/zongzesheng/code/origin/pytorch/INTERFACE" -isystem "/home/zongzesheng/code/origin/pytorch/third_party/nlohmann/include" -m64 "/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.cpp" -o "caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/ActivationHardshrinkKernel.cu.o"
In file included from tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:1:
/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:12:202: error: type/value mismatch at argument 3 in template parameter list for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’
   12 | dl_wrapper_t<false,false,__nv_dl_tag<void (*)(   ::at::TensorIteratorBase &, const  ::c10::Scalar &),(&    ::at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> _ZZZZN67_INTERNAL_8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_5342252at6native69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53422517hardshrink_kernelERNS0_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_;
      |                                                                                                                                                                                            ^

/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:12:202: note:   expected a constant of type ‘bool’, got ‘__nv_dl_tag<void (*)(at::TensorIteratorBase&, const c10::Scalar&), at::native::_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_534225::hardshrink_kernel, 1>’
/tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c:13:288: error: type/value mismatch at argument 3 in template parameter list for ‘template<bool IsMutable, bool HasFuncPtrConv, bool NeverThrows, class Tag, class OpFunc, class ... CapturedVarTypePack> struct __nv_hdl_wrapper_t’

Part of /tmp/tmpxft_000826c0_00000000-6_ActivationHardshrinkKernel.cudafe1.stub.c in here:

#pragma GCC diagnostic push
  2 #pragma GCC diagnostic ignored "-Wunused-function"
  3 #pragma GCC diagnostic ignored "-Wcast-qual"
  4 #define __NV_CUBIN_HANDLE_STORAGE__ static
  5 #if !defined(__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__)
  6 #define __CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__
  7 #endif
  8 #include "crt/host_runtime.h"
  9 #include "tmpxft_000826c0_00000000-3_ActivationHardshrinkKernel.fatbin.c"
 10 typedef TrivialOffsetCalculator<(int)1, unsigned int>  _Z23TrivialOffsetCalculatorILi1EjE;
 11 typedef at::detail::Array<char *, (int)2>  _ZN2at6detail5ArrayIPcLi2EEE;
 12 typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)(   ::at::TensorIteratorBase &, const  ::c10::Scalar &),(&    ::at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>     _ZZZZN67_INTERNAL_8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_5370802at6native69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53708017hardshrink_kernelERNS0_18TensorIteratorBaseER    KN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_;
 13 typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),    (& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> &),(& :: at::native::gpu_kernel_impl< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase     &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail::    Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,doub    le (double),double>, ::at::detail::Array< ::c10::ScalarType, (int)2> > _ZZN2at6native15gpu_kernel_implIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53708017hardshrink_kernelERN    S_18TensorIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE_clEvEUldE_EEvS4_RKT_EUliE_;
 14 typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),    (& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double> &),(& :: at::native::gpu_kernel_impl_nocast< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorItera    torBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1> ,double (double),double>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::d    etail::Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 1    > ,double (double),double>> _ZZN2at6native22gpu_kernel_impl_nocastIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53708017hardshrink_kernelERNS_18TensorIteratorBaseERKN3c106Scala    rEENKUlvE_clEvENKUlvE_clEvEUldE_EEvS4_RKT_EUliE_;
 15 typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)(   ::at::TensorIteratorBase &, const  ::c10::Scalar &),(&    ::at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,float (float),float> _ZZ    ZZN67_INTERNAL_8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_5370802at6native69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53708017hardshrink_kernelERNS0_18TensorIteratorBaseERKN3    c106ScalarEENKUlvE_clEvENKUlvE0_clEvEUlfE_;
 16 typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),    (& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,float (float),float> &),(& :: at::native::gpu_kernel_impl< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &,     const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,float (float),float>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail::Array<    char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,float (flo    at),float>, ::at::detail::Array< ::c10::ScalarType, (int)2> > _ZZN2at6native15gpu_kernel_implIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53708017hardshrink_kernelERNS_18Tenso    rIteratorBaseERKN3c106ScalarEENKUlvE_clEvENKUlvE0_clEvEUlfE_EEvS4_RKT_EUliE_;
 17 typedef __nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),    (& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,float (float),float> &),(& :: at::native::gpu_kernel_impl_nocast< ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIterator    Base &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,float (float),float>> ), 1> ,void (int),    ::OffsetCalculator<(int)2, unsigned int, (bool)0> , ::at::detail:    :Array<char *, (int)2> ,const  ::__nv_hdl_wrapper_t<false,false,__nv_dl_tag<void (*)( ::at::TensorIteratorBase &, const  ::c10::Scalar &),(& :: at::native::_NV_ANON_NAMESPACE::hardshrink_kernel), 2> ,flo    at (float),float>> _ZZN2at6native22gpu_kernel_impl_nocastIZZZNS0_69_GLOBAL__N__8206546f_29_ActivationHardshrinkKernel_cu_841a3ccc_53708017hardshrink_kernelERNS_18TensorIteratorBaseERKN3c106ScalarEENKUlvE    _clEvENKUlvE0_clEvEUlfE_EEvS4_RKT_EUliE_;

I’m not able to reproduce the build issue in a clean Ubuntu 22.04 container using the same CUDA toolkit and PyTorch commit:

git status
HEAD detached at 4d11a9b783
nothing to commit, working tree clean
...
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
...
python setup.py develop
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
...
-- Found CUDA: /usr/local/cuda (found version "12.3")
-- The CUDA compiler identification is NVIDIA 12.3.107
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.3.107")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Caffe2: CUDA detected: 12.3
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 12.3
...

I think it’s not pytorch problem as well, just doesn’t have any clue how to fix it, maybe reinstall vm environment and try again, thanks for you help!

Which gcc version are you using? I guess yours might be too new or old (I’m using 11.4.0).

Not sure if somebody changed the gcc version, since it’s dev env used by multiple developers, I will update results next Monday, have a nice weekend, thanks! :smiley:

Problem resolved after reinstall env, but cannot track which has been changed in env now, thanks for your time!