[SOLVED]Torch.grid_sample?

Hi, thank you always for your support.
I cannot understand the output torch.grid_sampler at the end or the torch.nn.functional.grid_sample.
Before watching the code, I expected that the grid_sample function calculate an output torch tensor sampled from the input tensor.
What does the grid_sample do?

Thank you in advance:)

2 Likes

The method samples the output from the input using the specified grid.
Have a look at this example:

input = torch.arange(4*4).view(1, 1, 4, 4).float()
print(input)
> tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])

# Create grid to upsample input
d = torch.linspace(-1, 1, 8)
meshx, meshy = torch.meshgrid((d, d))
grid = torch.stack((meshy, meshx), 2)
grid = grid.unsqueeze(0) # add batch dim

output = torch.nn.functional.grid_sample(input, grid)
print(output)
> tensor([[[[ 0.0000,  0.4286,  0.8571,  1.2857,  1.7143,  2.1429,  2.5714,
            3.0000],
          [ 1.7143,  2.1429,  2.5714,  3.0000,  3.4286,  3.8571,  4.2857,
            4.7143],
          [ 3.4286,  3.8571,  4.2857,  4.7143,  5.1429,  5.5714,  6.0000,
            6.4286],
          [ 5.1429,  5.5714,  6.0000,  6.4286,  6.8571,  7.2857,  7.7143,
            8.1429],
          [ 6.8571,  7.2857,  7.7143,  8.1429,  8.5714,  9.0000,  9.4286,
            9.8571],
          [ 8.5714,  9.0000,  9.4286,  9.8571, 10.2857, 10.7143, 11.1429,
           11.5714],
          [10.2857, 10.7143, 11.1429, 11.5714, 12.0000, 12.4286, 12.8571,
           13.2857],
          [12.0000, 12.4286, 12.8571, 13.2857, 13.7143, 14.1429, 14.5714,
           15.0000]]]])
16 Likes

Thank you always for your kind reply, ptrblck.
Your provided sample code is quite understandable and helpful for me.
I now see what torch.grid_sample does.
I am grateful for your continuous support!

2 Likes

Hi @ptrblck that was a nice explanation. However, I would like to know the intuition behind selecting the linspace range and interval.
Also, if I wanted to do a downsample, or an affine transform using this, how can i proceed? How will I create the grid if, say, I want a piecewise affine transform and I have some source and destination points. For the sake of example, let’s assume src points as [[1,1],[2,2],[2,3],[1,2]] and dst points as [[1,1],[1,2],[2,2],[1,2]] ( transforming a trapezium to a square). How to start with this?

That’s an interesting use case and I’m not sure at the moment, how you could easily create the grid for these transformation. So far I’ve stuck to the matrix operations as seen e.g. here for a rotation.

@Hmrishav_Bandyopadhy If you want to do an affine transformation, then you can use code similar to what ptrblck shows above, but instead of using linspace/meshgrid to produce the grid, use F.affine_grid().
You just have to pass it a matrix theta which contains the affine parameters for your transformation.
Then pass the resulting grid to grid_sample.

For upsampling/downsampling specifically, I wouldn’t recommend using grid_sample (although, of course, you could if you really wanted to). For this, you should probably use F.interpolate(), which is specifically for upsampling and downsampling.

If you’re wondering about the range of coordinates in the grid passed to grid sample, they range from -1 to 1, where -1 refers to either the top or left edges of the sampled image (in the x and y axes, respectively), while +1 refers to the bottom or right edges of the sampled image. So a grid point containing (0, 0), for example, would sample from the center of the image.

2 Likes

@bnehoran Thanks for the detailed answer. Affine transform (The function in PyTorch) works nice for stuff like rotation, zoom in and zoom out. How is a skew operation, or an operation like image deformation, or a Piecewise affine Transform possible with F.affine_grid() ?

It seems that theta has too small dimensions in order to completely form the affine grid for something other than a rotation, or a zoom-in (which is probably its intended use) – which is why i resorted to directly computing the grid and applying the F.grid_sample() operation instead of going through the F.affine_grid() operation

Yeah, you are right. If you want to use more complex transformations that are not affine, then you need to generate your own custom grid to pass to grid_sample.

If you want, you can always start by calling affine_grid and then modifying the grid that you get to your liking before passing it to grid_sample, but that’s entirely up to you.

Just keep in mind that if you generate your own grid by hand, you need to worry about things like which way you are setting the align_corners parameter in order to make sure that the grid you are producing has the right values to get your intended effect.

2 Likes

@bnehoran can you elaborate on how a grid can be generated by hand ( like the constraints involved) ? I have been opening up issues and looking up the docs, but have not found a single reference to it :confused:
I have been trying to “learn” a grid by keeping it as an output of a layer. However, since I know nothing of the constraints, its impossible to create the model ! In fact whatever models I have had created to get some insight to the grid have failed miserably.

Okay–so i am posting part of the solution to my question just above here, as I have found it through some calculations and observing the results from F.grid_sample()

Let us say grid sample is working with the help of a grid which has, say, dimensions 1x64x64x2. The final dimensions of the image is 64x64. Now, say we are “filling” up the final image with values coming from the initial image. The first pixel (0,0) will have value as img[grid[0,0,0],grid[0,0,1]] where img is the original image and grid is the grid being used. The rest of the pixels will have values in a similar fashion.

Now, an important thing is the grid is a float tensor, and, thus cant be used for indexing. So, my question here is, do we average the pixel values? Say when we have the grid pixel at (0,0,0) as 41.5, do we do ((img[41,y]+img[42,y])/2.0) or not?–this method of pixel averaging somehow seems to be a crude approach, hence I am unsure.

I still fail to understand, can you explain in detail?
Thanks in advance!

Sorry to dig up an old question with an accepted answer. I want to get an intuitive/geometric feel for this function. The docs mention a flow field grid, which I googled and understand in the sense of motion, but not generally.

I’d like to restate your example in terms of a single-channel image.

Let’s say I’ve got a single channel image with resolution (4, 4). Then I want to find the color at some spatial point (2.5, 2.5), which is not captured by the resolution of the image. This seems like a case for upsampling to me.

Now in your example, your “flow field grid” is (1, 8, 8, 2), where it seems that grid[:, i, j, :] gives me something like a spatial coordinate (not index) of some unit box.

So somehow, I pass my input (4, 4) image in with a grid that has the resolution that I want and I get out an upsampled image.

Here’s what I don’t understand:

What is this grid and what do the contained values signify.?

I see that its dimensions make it a “flow field” but how does that connect to upsampling?

The grid contains the normalized coordinates which should be used to interpolate the image.
They are normalized in [-1, 1] and these values are mapped to the “corners” of the input (the “corner” definition depends on the align_corners argument as well).
The docs explain it as:

For each output location output[n, :, h, w], the size-2 vector grid[n, h, w] specifies input pixel locations x and y, which are used to interpolate the output value output[n, :, h, w]. In the case of 5D inputs, grid[n, d, h, w] specifies the x, y, z pixel locations for interpolating output[n, :, d, h, w]. mode argument specifies nearest or bilinear interpolation method to sample the input pixels.

grid specifies the sampling pixel locations normalized by the input spatial dimensions. Therefore, it should have most values in the range of [-1, 1]. For example, values x = -1, y = -1 is the left-top pixel of input, and values x = 1, y = 1 is the right-bottom pixel of input.

If grid has values outside the range of [-1, 1], the corresponding outputs are handled as defined by padding_mode.

I’m sorry, that doesn’t answer the question I’m trying to ask. Let me clarify my current understanding and give context for the question:

The input is a single-channel (for simplicity) image.

Then we have the “grid”, which is a stack of two maps such that: destination(x, y) = stack(xmap(x, y), ymap(x, y)).

This grid tells a pixel of the “input” where to go from its original location (x, y) to some (x_hat, y_hat) in the transformed (upsampled or downsampled). For upsampling, the mapping of any point/vector would have to be one-to-many and there’d have to be some (linear?) combination of pixels in the source image.

My problem is framed in a slightly different context, and the intent of my question is to connect those two contexts. Here’s my use-case:

I sample points on a 3d mesh with a geodetic coordinate system. Via a special transform, I am able to locate the (row/column) associated with that 3d point in several different views of the same object.

The returned row/column may not be an integer value but something in between pixels.

In this case, the input is the image and the grid is a set of (row, column) coordinates that originate from the sampled 3d mesh.

colors = F.grid_sample(image, sample_coordinates)

sample_coordinates is not a grid. It is a set of coordinates with shape:

(1 [batch], num_samples, 1 [dummy], 2 [row, column])

such that the output colors is the interpolated colors at those sample points of those non-integer row/columns (which is the grid argument).

This is already done in our code base and works fine. I am inheriting this code and trying to make a connection in my mind between.

  1. input image pixels and "flow grid" .
  2. input image pixels and sample coordinates (occupying the grid variable).

The mental model difference between these two contexts is that in

  1. grid is a map and input is the object being mapped to a higher or lower resolution.
  2. grid is set of continuous coordinates associated with discrete scalar field of colors (the input image).

So rephrasing my original question:

What is the mathematical interaction of the input argument and the grid argument?

This will not only help me understand the shape of inputs, but also the unclear (to me) normalization requirements specified here:

I think that writing out that novel helped me get an understanding:

In both cases input is the reference and grid is the set of coordinates that say (I want the value of ‘input’ at these coordinates)

In context 1, we want an entirely resampled image. So we just create an evenly spaced grid of normalized coordinates. In my context, I only want a few points at a time but the concept is the same.

Note:

Documenting this function in terms of a “flow grid” sounds like someone trying to be smart instead of helpful. I’ve spent hours trying to figure this out because of PyTorch’s convoluted docs.

My entire misunderstanding was centered on the relationship between a flow grid and an input image. Kornia’s docs didn’t help much by using “map” instead of “flow grid” because a flow grid is a map and the word map adds nothing.

The only thing we are mapping is coordinates to pixel values. Those coordinates are framed as the map by the documentation, rather than the thing that is being mapped. The relationship is exactly opposite of what is described in the documents.

Just for the completeness, grid_sample(), affine_grid() are inspired from the paper:

1 Like