I am writing a CUDA kernel for my project. I have been following the pytorch CUDA extension tutorials. However, as I understand, such an approach only supports operations contiguous tensors. How can I improve my extension to support non-contiguous tensors?
Pointers to the codes of PyTorch’s own supports for non-contiguous tensor would also be very helpful.
Thank you in advance!
First of all, calling
.contiguous() on the input will make sure you have a contiguous tensor and won’t be noticeable for most workload. I would recommand this solution as it is much simpler and may actually be faster than the non-contiguous counterpart.
To support non-contiguous tensor, you would need to access each element by taking into account the
stride of each dimension properly. So
val[ind0, ind1] = data_ptr + storage_offset + ind0*stride0 + ind1*stride1. The thing is that this can make contiguous reads in cuda non contiguous anymore and destroy your kernel’s performances.