Loss function on cropped images, "requires_grad" issue: leaf variable has been moved into the graph interior

I have some images “X” which I pass through a network that outputs some boundary boxes. Based on these boundary boxes I crop “X” and get “cropped X”. The “cropped X” is reduced to grayscale and the mean is computed (per image).

In case the mean is < 0.8, I change it to 0, otherwise, it keeps its value. This is then used with a SmoothL1Loss with the hopes that I can train the network to output boundary boxes that contain some content and it’s not blank, i.e. their grayscale mean is lower than 0.8.

Up until the computation of the boundary boxes, I understand the gradient flow. From there on, I would like some advice/insights.

I slice the images “X” based on the boundary boxes, i.e.:
temp_x_2 = temp_x[:, int(bbxes[i, j, 2]):int(bbxes[i, j, 2] + bbxes[i, j, 0]),
int(bbxes[i, j, 3]):int(bbxes[i, j, 3] + bbxes[i, j, 1])]

(A) Do I need to set the “requires_grad” to True for temp_x? Note that bbxes have requires_grad set to True.

Following that, I manually convert them to grayscale and compute the mean, all using torch operations, i.e.:

temp_x_2[0, :, :] = temp_x_2[0, :, :] * 299/1000
temp_x_2[1, :, :] = temp_x_2[1, :, :] * 587/1000
temp_x_2[2, :, :] = temp_x_2[2, :, :] * 114/1000
temp_x_2 = torch.sum(temp_x_2, dim=0)
… = torch.mean(temp_x_2)

(B) Does this look (1) right and (2) efficient? I avoid pillow operations to keep the gradient flow going.

Finally, when I pass the means into SmoothL1Loss with the target set to 0, I get “leaf variable has been moved into the graph interior”. © Why is this the case?

Any insights & advice would be greatly appreciated!


  • Indexing operation is not differentiable with respect to the index. So no gradient will flow back bbxes. Is that expected?
  • Indexing operation is differentiable with respect to the indexed tensor. So if it requires grad, then the output will too, otherwise it won’t. You should not set it by hand.
  • This error occurs because of inplace modifications. I would replace your grascale conversion to tmp_x_2 = temp_x_2[0, :, :] * 299/1000 + temp_x_2[1, :, :] * 587/1000 + temp_x_2[2, :, :] * 114/1000.

Using affine_grid and grid_sample will give you a behaviour like spatial transformer network. Which works well for some use cases.

The learning of the bounding boxes is a hard problem, especially if you don’t have a ground truth bounding box to learn from.
You may want to look at different attention mechanisms that exist to find ideas.

Of course. Thanks a lot, you’ve just saved me a couple of days with that remark!