How to pass pytorch cuda extension with non-float tensor types

Recently I’ve been trying to wrap my cuda kernel with pytorch c-extension. Many examples I can search online is to use THCudaTensor* in a c wrapper function, but it seems to me that the tensor is float. My cuda-kernel takes uint8 in and output int16. Should I still use THCudaTensor* as interface and cast them to uint8/int16 internally?

Also, if in the wrapper function I want to call the cuda kernel many times, each time the kernel produce some intermediate tensors as output. Should I pass the temporary output from outside or can I define the temporary within?

My cuda kernel:

__global__ my_cuda_kernel (uint8* input1, uint8* input2, int16* output)
{
    // do something
}

__global__ my_cuda_kernel_2 (uint8* input1, uint8* input2, int16* input3, int16* output)
{
    // do something
}

void c_wrapper(THCState* state, THCudaTensor* input1, THCudaTensor* input2, THCudaTensor* Output)
{
     uint8* input1_  = THCudaTensor_data(state, input1);  // how to cast?
     uint8* input2_ = THCudaTensor_data(state, input2);
     int16* output_ = THCudaTensor_data(state, output); // how to cast?
     int16* temporary_out = xxxx; // how can I allocate memory from within?

     my_cuda_kernel<<<32, 32, 0, THCState_getCurrentStream(state)>>>(input1, input2, temporary_out);
     my_cuda_kernel_2<<<32, 32, 0, THCState_getCurrentStream(state)>>>(input1, input2, temporary_out, output);

}

What if you keep the data as float*? Is there any issue with that?

My kernel is a complicated image processing algorithm, so it takes in rgb values as uint8 naturally. Sure I can change them to float, but it takes more space and I guess gpu would be slower in handling float.

I have not used integer tensor (THCudaIntTensor*) so far.
But I could see, it has been used in torch layers.