Recently I’ve been trying to wrap my cuda kernel with pytorch c-extension. Many examples I can search online is to use THCudaTensor* in a c wrapper function, but it seems to me that the tensor is float. My cuda-kernel takes uint8 in and output int16. Should I still use THCudaTensor* as interface and cast them to uint8/int16 internally?
Also, if in the wrapper function I want to call the cuda kernel many times, each time the kernel produce some intermediate tensors as output. Should I pass the temporary output from outside or can I define the temporary within?
My cuda kernel:
__global__ my_cuda_kernel (uint8* input1, uint8* input2, int16* output)
{
// do something
}
__global__ my_cuda_kernel_2 (uint8* input1, uint8* input2, int16* input3, int16* output)
{
// do something
}
void c_wrapper(THCState* state, THCudaTensor* input1, THCudaTensor* input2, THCudaTensor* Output)
{
uint8* input1_ = THCudaTensor_data(state, input1); // how to cast?
uint8* input2_ = THCudaTensor_data(state, input2);
int16* output_ = THCudaTensor_data(state, output); // how to cast?
int16* temporary_out = xxxx; // how can I allocate memory from within?
my_cuda_kernel<<<32, 32, 0, THCState_getCurrentStream(state)>>>(input1, input2, temporary_out);
my_cuda_kernel_2<<<32, 32, 0, THCState_getCurrentStream(state)>>>(input1, input2, temporary_out, output);
}