Using GPU or CPU kernel in C++ Extensions


I am currently working on porting parts of my functionality to C++.

In one of my functions I am iterating over a input tensor to produce two new output tensors.
For this I am using TensorIterator.

For the iteration I am using cpu_kernel_multiple_outputs, which works fine.
Now I try to include the logic for running on GPU, therefore I am using gpu_kernel_multiple_outputs.

Here a little code snippet:

// Check if running in CPU or GPU, based on input tensor
if (self.device(),type() == at:kCUDA) {
        iter, [=](int64_t a) -> std::tuple<int64_t, int64_t> { ... }
} else {
        iter, [=] GPU_LAMBDA (int64_t a) -> std::tuple<int64_t, int64_t> { ... }

To make this run I had to include: #include <ATen/native/cpu/Loops.h> for the cpu kernel
and #include <ATen/native/cuda/Loops.h>for the gpu kernel.

First of all, is this the right approach or am I including internals I should not be including?

Second, the import of the GPU kernel does not work, due to the missing <thrust/tuple.h> header, which looks like a external library. Do I manually have to include it to build the extension?

To be sure, here are all includes I am using:

#include <ATen/ATen.h>
#include <ATen/native/TensorIterator.h>
#include <ATen/native/cpu/Loops.h>
// #include <ATen/native/cuda/Loops.h>

#include <torch/extension.h>
#include <tuple>
#include <vector>

I am not a C++ developer, so please excuse me.

With best regards,