Surprising convention for grid sample coordinates

Hi,

The coordinate convention for the index in the grid sample is a bit surprising to me. Indeed, it seems to correspond to the standard tensor indexing coordinate but in reverse order.

Here is a code sample to observe what I am saying:

import torch

a = torch.rand(1, 1, 4, 3)
b = torch.nn.functional.grid_sample(a, torch.tensor([[[[1, -1]]]]).type(torch.FloatTensor), align_corners=True)
#b corresponds to top right corner, i.e. j,i coordinates
print(a[0, 0, 0, 2], b)

c = torch.rand(1, 1, 4, 3, 2)
d = torch.nn.functional.grid_sample(c, torch.tensor([[[[[1, 1, -1]]]]]).type(torch.FloatTensor), align_corners=True)
#Again, here the convention is k, j, i
print(c[0, 0, 0, 2, 1], d)

I was wondering if there were any reasoning behind this convention as it seems counter-intuitive to me.

Thank you for your help,

Samuel

3 Likes

Sorry to insist but I think it should at least be specified in the documentation as the coordinate system used to specify positions where to sample in the input tensor (i.e. the index tensor) is not intuitive.

I’m not sure, if I misunderstand the question, but I would assume this result based on the description from the docs:

grid specifies the sampling pixel locations normalized by the input spatial dimensions. Therefore, it should have most values in the range of [-1, 1] . For example, values x = -1, y = -1 is the left-top pixel of input , and values x = 1, y = 1 is the right-bottom pixel of input .

a = torch.rand(1, 1, 4, 3)
top_left = torch.nn.functional.grid_sample(a, torch.tensor([[[[-1, -1]]]]).type(torch.FloatTensor), align_corners=True)
top_right = torch.nn.functional.grid_sample(a, torch.tensor([[[[1, -1]]]]).type(torch.FloatTensor), align_corners=True)
bottom_left = torch.nn.functional.grid_sample(a, torch.tensor([[[[-1, 1]]]]).type(torch.FloatTensor), align_corners=True)
bottom_right = torch.nn.functional.grid_sample(a, torch.tensor([[[[1, 1]]]]).type(torch.FloatTensor), align_corners=True)
#b corresponds to top right corner, i.e. j,i coordinates
print(a[0, 0, 0, 0], top_left)
print(a[0, 0, 0, -1], top_right)
print(a[0, 0, -1, 0], bottom_left)
print(a[0, 0, -1, -1], bottom_right)

Is the confusion created because of the usage of an image coordinate system with x and y axes?
Note that the x-axis is along the width of an image and the y-axis is along its height.

2 Likes

Hi,

Thank you for your answer. The confusion to me comes from when the input tensor is a 5D tensor (so when we consider a batch of 3D inputs with some channels). Then the convention is k,j,i meaning z,y,x which I find surprising (and is not specified in the documentation).

That is what I meant.

Samuel

1 Like

Ah OK, thanks for the follow up.
You could create an issue and suggest to improve the docs.
Would you be interested in creating a PR with this improvement?

Hi,

I will create an issue to suggest those specification in the docs. Sorry but I don’t want to handle the PR for this. Thank you for your help.

@ptrblck, Noticed the same thing. I think from a code perspective, the opposite is the more intuitive and expected outcome.

Exactly. I was expected that the image coordinate system will be as same as in a tensor, 2d matrix or any other indexed container. In the terminology you used, the x-axis to be along the height (rows) and the y-axis to be along the width.

Is it worth a PR if ill look at it and change the code in that way ?
Like instead changing the docs, I think it more appropriate to change the behavior.

Or maybe there is some logic in that choice, which I don’t get by first look. I would gladly hear what stands behind this choice.

1 Like

Introducing backwards compatibility breaking changes are not easily accepted, but you should discuss it with the code owner by creating a GitHub issue and describing your idea.

The camera coordinate system is defined by having the origin in the “top left” corner and using positive values for the x- and y-axis and is a standard in image processing.
I know that it can be confusing especially if you are more used to working with general “matrices”.

1 Like

This also confused me for days. But a possible understanding for the choice of ''x-axis width, y-axis height ‘’ is that openCV also uses this kind of coordinate system definition.

Maybe using ‘left-top’ corner here is better since x corresponds to left, and y corresponds to top.