I think what you mean is when you do a.t()
and then read that memory in your CUDA kernel, a is not actually transposed? I believe this is because a.t()
does not actually tranpose the data in memory but it just returns a view of the data. But in your CUDA kernel, when your read b
, you are still reading a
not the transposed version of a
.
I had the same thing when I was trying to implement a fully connected layer in CUDA. I provided a link to my implementation as a reply to your other thread here. What I did was to flatten the array before I called the CUDA kernel so that b
would be contiguous and correctly transposed in the CUDA kernel.