I want to write a custom point-wise cuda kernel and I can do that easily with cpp_extension
, and a custom cuda kernel luncher. The problem is that I don’t know how to set a good block_size.
As far as I can tell CUDA_tensor_apply2
handles those parameters and simplifies my work. But I need CUDA_tensor_apply3
which has been removed from the aten.
It seems like I should use TensorIterator
with gpu_kernel
but I can not include it into my code. Should I include <ATen/native/CUDALoops.cuh>
? If so, I can’t and there is no such file in my conda installation (I’m using torch==1.5 with cudatoolkit==10.2)
@Separius
CUDALoops.cuh is not exposed.
TensorIterator is exposed. To use it, you can do
#include <ATen/native/TensorIterator.h>
For elementwise cuda kernel block setup, you can take a look at here, we have basic thread and block setup here.
And within CUDALoops.cuh, you can check gpu_kernel_impl() to see how we assign thread and block size for different cases. Basically, for 1d tesnor, we set thread to 512 and each thread handle 1 item, for multi-dimension tensor, we set thread number to 128, but each thread handle 4 items.
2 Likes