Lazy tensor branch (lazy_tensor_staging) compiled failed cause FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/

Hi everyone,
I got some errors when I compiling the lazy_tensor_staging branch. I tried for a few days and it did not solve.
FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/cuda/xxxxx
and
cc1plus: some warnings being treated as errors
But compiling the master branch with the same environment variables and other settings is successful.
The settings are as follows:

-- ******** Summary ********
--   CMake version             : 3.22.1
--   CMake command             : /mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla/bin/cmake
--   System                    : Linux
--   C++ compiler              : /mnt/lustre/share/gcc/gcc-7.5/bin/g++
--   C++ compiler version      : 7.5.0
--   CXX flags                 :  -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -Wnon-virtual-dtor
--   Build type                : Release
--   Compile definitions       : ONNX_ML=1;ONNXIFI_ENABLE_EXT=1;__STDC_FORMAT_MACROS
--   CMAKE_PREFIX_PATH         : /mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla/lib/python3.7/site-packages;/mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla;/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0
--   CMAKE_INSTALL_PREFIX      : /mnt/cache/zhangyuchang.vendor/newtest/pytorch/torch
--   CMAKE_MODULE_PATH         : /mnt/cache/zhangyuchang.vendor/newtest/pytorch/cmake/Modules;/mnt/cache/zhangyuchang.vendor/newtest/pytorch/cmake/public/../Modules_CUDA_fix
-- 
--   ONNX version              : 1.11.0
--   ONNX NAMESPACE            : onnx_torch
--   ONNX_USE_LITE_PROTO       : OFF
--   USE_PROTOBUF_SHARED_LIBS  : OFF
--   Protobuf_USE_STATIC_LIBS  : ON
--   ONNX_DISABLE_EXCEPTIONS   : OFF
--   ONNX_WERROR               : OFF
--   ONNX_BUILD_TESTS          : OFF
--   ONNX_BUILD_BENCHMARKS     : OFF
--   ONNXIFI_DUMMY_BACKEND     : OFF
--   ONNXIFI_ENABLE_EXT        : OFF
-- 
--   Protobuf compiler         : 
--   Protobuf includes         : 
--   Protobuf libraries        : 
--   BUILD_ONNX_PYTHON         : OFF
-- 
-- ******** Summary ********
--   CMake version         : 3.22.1
--   CMake command         : /mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla/bin/cmake
--   System                : Linux
--   C++ compiler          : /mnt/lustre/share/gcc/gcc-7.5/bin/g++
--   C++ compiler version  : 7.5.0
--   CXX flags             :  -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -Wnon-virtual-dtor
--   Build type            : Release
--   Compile definitions   : ONNX_ML=1;ONNXIFI_ENABLE_EXT=1
--   CMAKE_PREFIX_PATH     : /mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla/lib/python3.7/site-packages;/mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla;/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0
--   CMAKE_INSTALL_PREFIX  : /mnt/cache/zhangyuchang.vendor/newtest/pytorch/torch
--   CMAKE_MODULE_PATH     : /mnt/cache/zhangyuchang.vendor/newtest/pytorch/cmake/Modules;/mnt/cache/zhangyuchang.vendor/newtest/pytorch/cmake/public/../Modules_CUDA_fix
-- 
--   ONNX version          : 1.4.1
--   ONNX NAMESPACE        : onnx_torch
--   ONNX_BUILD_TESTS      : OFF
--   ONNX_BUILD_BENCHMARKS : OFF
--   ONNX_USE_LITE_PROTO   : OFF
--   ONNXIFI_DUMMY_BACKEND : OFF
-- 
--   Protobuf compiler     : 
--   Protobuf includes     : 
--   Protobuf libraries    : 
--   BUILD_ONNX_PYTHON     : OFF
-- Found CUDA with FP16 support, compiling with torch.cuda.HalfTensor
-- Adding -DNDEBUG to compile flags
-- Compiling with MAGMA support
-- MAGMA INCLUDE DIRECTORIES: /mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla/include
-- MAGMA LIBRARIES: /mnt/lustre/zhangyuchang.vendor/.conda/envs/torch_xla/lib/libmagma.a
-- MAGMA V2 check: 0
-- Could not find hardware support for NEON on this machine.
-- No OMAP3 processor on this machine.
-- No OMAP4 processor on this machine.
-- Found a library with LAPACK API (mkl).
disabling ROCM because NOT USE_ROCM is set
-- MIOpen not found. Compiling without MIOpen support
disabling MKLDNN because USE_MKLDNN is not set
-- Version: 7.0.3
-- Build type: Release
-- CXX_STANDARD: 14
-- Required features: cxx_variadic_templates
-- Using Kineto with CUPTI support
-- Configuring Kineto dependency:
--   KINETO_SOURCE_DIR = /mnt/cache/zhangyuchang.vendor/newtest/pytorch/third_party/kineto/libkineto
--   KINETO_BUILD_TESTS = OFF
--   KINETO_LIBRARY_TYPE = static
--   CUDA_SOURCE_DIR = /mnt/cache/share/platform/dep/cuda11.0-cudnn8.0
--   CUDA_INCLUDE_DIRS = /mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/include
--   CUPTI_INCLUDE_DIR = /mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/include
--   CUDA_cupti_LIBRARY = /mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/libcupti.so
-- Found CUPTI
INFO ROCM_SOURCE_DIR = 
-- Kineto: FMT_SOURCE_DIR = /mnt/cache/zhangyuchang.vendor/newtest/pytorch/third_party/fmt
-- Kineto: FMT_INCLUDE_DIR = /mnt/cache/zhangyuchang.vendor/newtest/pytorch/third_party/fmt/include
INFO CUPTI_INCLUDE_DIR = /mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/include
INFO ROCTRACER_INCLUDE_DIR = /roctracer/include
-- Configured Kineto
-- GCC 7.5.0: Adding gcc and gcc_s libs to link line
-- NUMA paths:
-- /usr/include
-- /usr/lib64/libnuma.so
-- headers outputs: 
-- sources outputs: 
-- declarations_yaml outputs: 
-- Using ATen parallel backend: OMP

At the moment I guess it should be the reason that the version of cuda is too high or the code of the current branch is the problem. Any comments would be appreciated.