Surprising convention for grid sample coordinates


The coordinate convention for the index in the grid sample is a bit surprising to me. Indeed, it seems to correspond to the standard tensor indexing coordinate but in reverse order.

Here is a code sample to observe what I am saying:

import torch

a = torch.rand(1, 1, 4, 3)
b = torch.nn.functional.grid_sample(a, torch.tensor([[[[1, -1]]]]).type(torch.FloatTensor), align_corners=True)
#b corresponds to top right corner, i.e. j,i coordinates
print(a[0, 0, 0, 2], b)

c = torch.rand(1, 1, 4, 3, 2)
d = torch.nn.functional.grid_sample(c, torch.tensor([[[[[1, 1, -1]]]]]).type(torch.FloatTensor), align_corners=True)
#Again, here the convention is k, j, i
print(c[0, 0, 0, 2, 1], d)

I was wondering if there were any reasoning behind this convention as it seems counter-intuitive to me.

Thank you for your help,


1 Like

Sorry to insist but I think it should at least be specified in the documentation as the coordinate system used to specify positions where to sample in the input tensor (i.e. the index tensor) is not intuitive.

I’m not sure, if I misunderstand the question, but I would assume this result based on the description from the docs:

grid specifies the sampling pixel locations normalized by the input spatial dimensions. Therefore, it should have most values in the range of [-1, 1] . For example, values x = -1, y = -1 is the left-top pixel of input , and values x = 1, y = 1 is the right-bottom pixel of input .

a = torch.rand(1, 1, 4, 3)
top_left = torch.nn.functional.grid_sample(a, torch.tensor([[[[-1, -1]]]]).type(torch.FloatTensor), align_corners=True)
top_right = torch.nn.functional.grid_sample(a, torch.tensor([[[[1, -1]]]]).type(torch.FloatTensor), align_corners=True)
bottom_left = torch.nn.functional.grid_sample(a, torch.tensor([[[[-1, 1]]]]).type(torch.FloatTensor), align_corners=True)
bottom_right = torch.nn.functional.grid_sample(a, torch.tensor([[[[1, 1]]]]).type(torch.FloatTensor), align_corners=True)
#b corresponds to top right corner, i.e. j,i coordinates
print(a[0, 0, 0, 0], top_left)
print(a[0, 0, 0, -1], top_right)
print(a[0, 0, -1, 0], bottom_left)
print(a[0, 0, -1, -1], bottom_right)

Is the confusion created because of the usage of an image coordinate system with x and y axes?
Note that the x-axis is along the width of an image and the y-axis is along its height.


Thank you for your answer. The confusion to me comes from when the input tensor is a 5D tensor (so when we consider a batch of 3D inputs with some channels). Then the convention is k,j,i meaning z,y,x which I find surprising (and is not specified in the documentation).

That is what I meant.


Ah OK, thanks for the follow up.
You could create an issue and suggest to improve the docs.
Would you be interested in creating a PR with this improvement?


I will create an issue to suggest those specification in the docs. Sorry but I don’t want to handle the PR for this. Thank you for your help.