Correct way to implement cuda spatial operations

Hello,
here are some questions, as I have been working on pytorch internals, to figure out how to contribute for some modules I need.

If there is a more appropriate forum for development related question, let me know !

What is the general rule for spatial operations ?

I have been looking at some of the cuda kernels implemented in THCUNN, and have been trying to figure out how threads behave.

From my understanding, a thread is located from threadIdx and blockIdx.x and thus you can deduce a postion in the tensor you are working on to ensure close threads are working on close values in the cache. It makes sense in an optimization point of view, this seems to be also stated here

This is for example what I can see for bilinear upsampling , index is converted to spatial values here

Every thread is dealing all values stored in w,h in the tensor.
if index is incremented (basically, threadIdx is 1 more, this is the adjacent thread), w is incremented, which is logical since lowest stride of our tensor is deemed to be last dimension when it’s contiguous.

Actually, are we sure about that ? is there some preprocessing to make stride constant regardless of the input variable given to the mytorch high level module ?

What I can also see is that a thread deals with all the values that located in w,h , regardless of batch and channel value. (see here )
Aren’t batch and channel strides the highest ? It would make the thread gather data from memory location far appart, which makes the memory coalition argument invalid.

Funnily enough, if I lookup another spatial module, such as SpatialGridSampler, batch and channel are now the first to be derived from thread index. Is it strided differently then ?

Thanks in advance for you answer !