I’m implementing a custom Cuda operation that involves typecasting.
What I want to achieve, if written in python, is similar to:
>>> a = torch.ones(10).to('cuda')
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
>>> a.dtype
torch.float32
>>> MyModule.myCustomOp(a, scaling_factor=100)
>>> a
tensor([100, 100, 100, 100, 100, 100, 100, 100, 100, 100], device='cuda:0', dtype=torch.int32)
a simplified version of my c++ code looks like the following:
__global__ void my_kernel(float* input, float scaling_factor) {
auto index = blockIdx.x * blockDim.x + threadIdx.x;
auto integer = int32_t(round(input[index] * scaling_factor));
// some other logics, not important here
int32_t* int_arr = (int32_t*) input;
int_arr[index] = integer;
}
void myCustomOp(torch::Tensor input, float scaling_factor) {
my_kernel<<<GRID_DIM, BLOCK_DIM>>>(input.data_ptr<float>(), scaling_factor);
CUDA_CHECK(cudaDeviceSynchronize());
// how to cast the type of input to int32_t?
}
My implementation does not change dtype
of the input
, so I get a meaningless torch.float32
tensor.
An alternative way is to create a new tensor and return it, but I’d like to do this in-place if possible.
Any suggestions, please?