Error occurred while compiling CUDA program using PyTorch

I encountered this issue while compiling a CUDA file using PyTorch, along with a series of errors related to CUB. However, if I directly compile the corresponding file using nvcc, no issues arise. I’m eager to know how to resolve this problem.
correct command without error:*nvcc -o main .cu

setup.py

from setuptools import setup, find_packages
from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CppExtension

setup(
    name = "xxx",
    include_dirs = ["."],
    ext_modules = [
        CUDAExtension(
        "xxx", 
        sources = [
            "xxx.cu","xxx1.cu","xxx2.cu","xxx3.cpp",
        ],
        extra_compile_args={
                'cxx': ['-std=c++14', '-g',
                        '-fPIC',
                        '-Ofast',
                        '-DSXN_REVISED',
                        '-Wall', '-fopenmp', '-march=native'],
                'nvcc': ['-std=c++14',
                         '-g',
                         '-DSXN_REVISED',
                         '--compiler-options', "'-fPIC'",
                         ]
            }
        )
    ],
    cmdclass={
        "build_ext": BuildExtension 
    },
)

Related file

#include <iostream>
#include <curand_kernel.h>
#include <vector>
#include <chrono>
#include <numeric>
#include <fstream>
#include <algorithm>
#include <map>
#include <sstream>
#include <cassert>
#include <cuda_runtime.h>
#include <stdint.h>
#include <cub/cub.cuh>
#include <torch/extension.h>

inline __device__ int64_t
AtomicCAS(int64_t* const address, const int64_t compare, const int64_t val) {
  using Type = unsigned long long int;  // NOLINT

  static_assert(sizeof(Type) == sizeof(*address), "Type width must match");

  return atomicCAS(
      reinterpret_cast<Type*>(address), static_cast<Type>(compare),
      static_cast<Type>(val));
}

The brief error message is error: ‘atomicCAS’ was not declared in this scope; did you mean ‘AtomicCAS’?