How to read memory like pytorch? Even after transpose?

I think what you mean is when you do a.t() and then read that memory in your CUDA kernel, a is not actually transposed? I believe this is because a.t() does not actually tranpose the data in memory but it just returns a view of the data. But in your CUDA kernel, when your read b, you are still reading a not the transposed version of a.

I had the same thing when I was trying to implement a fully connected layer in CUDA. I provided a link to my implementation as a reply to your other thread here. What I did was to flatten the array before I called the CUDA kernel so that b would be contiguous and correctly transposed in the CUDA kernel.