Hello,
I am currently working on porting parts of my functionality to C++.
In one of my functions I am iterating over a input tensor to produce two new output tensors.
For this I am using TensorIterator
.
For the iteration I am using cpu_kernel_multiple_outputs
, which works fine.
Now I try to include the logic for running on GPU, therefore I am using gpu_kernel_multiple_outputs
.
Here a little code snippet:
// Check if running in CPU or GPU, based on input tensor
if (self.device(),type() == at:kCUDA) {
cpu_kernel_multiple_outputs(
iter, [=](int64_t a) -> std::tuple<int64_t, int64_t> { ... }
);
} else {
gpu_kernel_multiple_outputs(
iter, [=] GPU_LAMBDA (int64_t a) -> std::tuple<int64_t, int64_t> { ... }
);
}
To make this run I had to include: #include <ATen/native/cpu/Loops.h>
for the cpu kernel
and #include <ATen/native/cuda/Loops.h>
for the gpu kernel.
First of all, is this the right approach or am I including internals I should not be including?
Second, the import of the GPU kernel does not work, due to the missing <thrust/tuple.h>
header, which looks like a external library. Do I manually have to include it to build the extension?
To be sure, here are all includes I am using:
#include <ATen/ATen.h>
#include <ATen/native/TensorIterator.h>
#include <ATen/native/cpu/Loops.h>
// #include <ATen/native/cuda/Loops.h>
#include <torch/extension.h>
#include <tuple>
#include <vector>
I am not a C++ developer, so please excuse me.
With best regards,
Patrick