Tensor dimension ordering for efficient memory access

Hello, I have a question about memory layout and access efficiency.
I will explain about the situation and let me clarify the question.

I have certain amount of nodes(points, vertices). Each point has its own embedding of specific length(like, 3~1024).
So, this x’s dimension is [batch_size, embedding length, numbers of points];
(By the way, My batch size is always ‘1’.)

And i have an array which consists of random indices whose name is idx_p0 (like, [0, 71, 234, 22, 0 …, 2].

Using this idx_p0, i do random access to this x
y = x[:,:,idx_p0]

Here comes the question.
As, i do random access to the x. Then i will cause memory access(toward register file of gpu).
But i want to do this in efficient way.

Is it wise to change the dimension of the x to this ‘x_bar = x.transpose(2,1).contiguous()’.
Then the ‘x_bar’ becomes [batch_size, numbers of points, embedding length]. And do random access like this ‘y = x[:,idx_p0,:]’.

I want to know how the pytorch save the 3 dimension tensor.
Oh! my batch_size is usually only 1.

Thank you.