How to write a custom point-wise cuda kernel?

Separius · May 1, 2020, 5:41pm

I want to write a custom point-wise cuda kernel and I can do that easily with cpp_extension, and a custom cuda kernel luncher. The problem is that I don’t know how to set a good block_size.
As far as I can tell CUDA_tensor_apply2 handles those parameters and simplifies my work. But I need CUDA_tensor_apply3 which has been removed from the aten.
It seems like I should use TensorIterator with gpu_kernel but I can not include it into my code. Should I include <ATen/native/CUDALoops.cuh>? If so, I can’t and there is no such file in my conda installation (I’m using torch==1.5 with cudatoolkit==10.2)

glaringlee · May 7, 2020, 2:14am

@Separius
CUDALoops.cuh is not exposed.
TensorIterator is exposed. To use it, you can do
#include <ATen/native/TensorIterator.h>

For elementwise cuda kernel block setup, you can take a look at here, we have basic thread and block setup here.

github.com

pytorch/pytorch/blob/6fcabf619d2aeacc9867f41af8b36ddac06c3d25/aten/src/ATen/native/cuda/Loops.cuh#L10




#pragma once


#include <ATen/detail/FunctionTraits.h>
#include <ATen/native/TensorIterator.h>
#include <ATen/cuda/detail/OffsetCalculator.cuh>


namespace at { namespace native {


#define NUM_THREADS (C10_WARP_SIZE * 2)
#define THREAD_WORK_SIZE 4
#define BLOCK_WORK_SIZE (THREAD_WORK_SIZE * num_threads)


constexpr int num_threads = NUM_THREADS;
constexpr int thread_work_size = THREAD_WORK_SIZE;
constexpr int block_work_size = BLOCK_WORK_SIZE;


// `needs_dynamic_casting` compares the types expected by iterator
// (i.e. dtypes of the operands) with the actual type of the arguments
// of func_t

And within CUDALoops.cuh, you can check gpu_kernel_impl() to see how we assign thread and block size for different cases. Basically, for 1d tesnor, we set thread to 512 and each thread handle 1 item, for multi-dimension tensor, we set thread number to 128, but each thread handle 4 items.