Working example of CUDA accessor

I would like to know how to integrate the CUDA accessor example given here

in a CMake project. Just copy-pasting the code into a cpp file does not compile.

I don’t know what exactly is failing as you didn’t give any information about the issues, but maybe the right includes are missing?

Just pasting the code in a cpp file and compiling it with Clang or Gcc does not work because of the global keyword. I assume that I need to place it in a separate cu file and compile it with nvcc into a library. How is it done with the CUDA kernels in the C++ API’s libraries?

You might need to add CUDA as a language to your CMakeLists.txt e.g. via:

project(custom_ops LANGUAGES CXX CUDA)

Also make sure the file extension is .cu for your custom CUDA code.

I still try to reverse engineer how Torch is doing it internally. Assume that I have a C++ application written in the file app.cxx. Depending on whether the tensors are on the CPU or GPU I need to dispatch between a CPU accessor and a GPU accessor variant, e.g.

auto a = torch::zeros({10});
if (a.is_cuda()) 
  auto a_accessor = a.packed_accessor64<float,1>();
  packed_accessor_kernel<<<1, 10>>>(a_accessor); // does not compile by g++!!!
  auto a_accessor = a.accessor<float,1>();
  for (int64_t i=0; i<10; ++i)
    a_accessor[i] = i;

The corresponding CUDA kernel would read as follows

__global__ void packed_accessor_kernel(
    torch::PackedTensorAccessor64<float, 1> a) {
  int i = threadIdx.x;
  a[i] = i;

I could put the CUDA kernel into a .cu file and put the entire dispatch routine into a library that I link to my application. But this would introduce some overhead as I would have to call the dispatch routine instead of implementing it inline (so far all my code is header-only). Any help on how this is done efficiently in Torch is appreciated.