Hi,
First of all, calling .contiguous()
on the input will make sure you have a contiguous tensor and won’t be noticeable for most workload. I would recommand this solution as it is much simpler and may actually be faster than the non-contiguous counterpart.
To support non-contiguous tensor, you would need to access each element by taking into account the stride
of each dimension properly. So val[ind0, ind1] = data_ptr + storage_offset + ind0*stride0 + ind1*stride1
. The thing is that this can make contiguous reads in cuda non contiguous anymore and destroy your kernel’s performances.