How to write a custom point-wise cuda kernel?

I want to write a custom point-wise cuda kernel and I can do that easily with cpp_extension, and a custom cuda kernel luncher. The problem is that I don’t know how to set a good block_size.
As far as I can tell CUDA_tensor_apply2 handles those parameters and simplifies my work. But I need CUDA_tensor_apply3 which has been removed from the aten.
It seems like I should use TensorIterator with gpu_kernel but I can not include it into my code. Should I include <ATen/native/CUDALoops.cuh>? If so, I can’t and there is no such file in my conda installation (I’m using torch==1.5 with cudatoolkit==10.2)

CUDALoops.cuh is not exposed.
TensorIterator is exposed. To use it, you can do
#include <ATen/native/TensorIterator.h>

For elementwise cuda kernel block setup, you can take a look at here, we have basic thread and block setup here.

And within CUDALoops.cuh, you can check gpu_kernel_impl() to see how we assign thread and block size for different cases. Basically, for 1d tesnor, we set thread to 512 and each thread handle 1 item, for multi-dimension tensor, we set thread number to 128, but each thread handle 4 items.