Autograd problem when designing a custom loss

Luis · August 2, 2020, 6:04pm

Hello everybody,

I am having a hard time while I am trying to design a loss function that applies Sobel filter to the batches before computing MSE. I am quite sure that the problem is related to an “autograd computational graph detachment”, but I just cannot solve it.

Here is my code. Does anyone can see what I am missing?

def sobel_MSE(output, target):
dx = (torch.tensor([[1.0, 0.0, -1.0],[2.0, 0.0, -2.0],[1.0, 0.0, -1.0]], requires_grad=True)).float()
dy = (torch.tensor([[1.0, 2.0, 1.0], [0.0, 0.0, 0.0], [-1.0, -2.0, -1.0]], requires_grad=True)).float()

dx = dx.cuda()
dy = dy.cuda()

dx = dx.view((1, 1, 3, 3))
dy = dy.view((1, 1, 3, 3))

doutdx = nn.functional.conv2d(output, dx, padding=1)
doutdy = nn.functional.conv2d(output, dy, padding=1)

dtardx = nn.functional.conv2d(target, dx, padding=1)
dtardy = nn.functional.conv2d(target, dy, padding=1)

dout = torch.sqrt(torch.pow(doutdx, 2) + torch.pow(doutdy, 2))
dtar = torch.sqrt(torch.pow(dtardx, 2) + torch.pow(dtardy, 2))

out = torch.mean(torch.pow(dout-dtar,2))

return out

ptrblck · August 3, 2020, 7:17am

Would you like to train dx and dy (since you’ve set their requires_grad attribute to True)?
If so, you shouldn’t overwrite them in these lines of code, but would need to use a new variable name and pass the original dx and dy to the optimizer:

dx = dx.cuda()
dy = dy.cuda()

dx = dx.view((1, 1, 3, 3))
dy = dy.view((1, 1, 3, 3))

On the other hand, if you don’t want to train these tensors, you could remove the requires_grad setting and as long as output is attached to the graph, the code should work.
Are you seeing any errors or unexpected behavior?

Luis · August 3, 2020, 12:13pm

Thank you @ptrblck for you reply.

I removed the requires_grad setting to make the sobel filter static. The problem is, when I run the code - after the first loss.backward() - the network output becomes nan. I believe it was some problem when computing grad.

If I change the loss for MSE doing:

def my_mse(output, target):
out = torch.mean(torch.pow(output-target,2))

everything works fine. This way I suppose that there is no problem in the rest of the code.

Do you have any idea about what I am missing?

Thanks a lot

ptrblck · August 4, 2020, 4:15am

Could you rerun the code with torch.autograd.set_detect_anomaly(True) at the beginning of the script and post the stack trace here?
Based on the description I assume the loss does not contain any invalid values?
If that’s the case, could you check all gradients after the first backward() pass for NaN values?

Luis · August 4, 2020, 1:43pm

Thanks again @ptrblck!

By enablig the anomaly detection, I found the following error trace:

..\torch\csrc\autograd\python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
 File "C:/Users/Luis/PycharmProjects/unet-googlenet/train.py", line 291, in <module>
   train()
 File "C:/Users/Luis/PycharmProjects/unet-googlenet/train.py", line 214, in train
   loss = sobel_MSE(outputs, labels)
 File "C:/Users/Luis/PycharmProjects/unet-googlenet/train.py", line 142, in sobel_MSE
   dout = torch.sqrt(torch.pow(doutdx, 2) + torch.pow(doutdy, 2))

Traceback (most recent call last):
 File "C:/Users/Luis/PycharmProjects/unet-googlenet/train.py", line 291, in <module>
   train()
 File "C:/Users/Luis/PycharmProjects/unet-googlenet/train.py", line 216, in train
   loss.backward()
 File "C:\Users\Luis\Anaconda3\envs\unet-goolenet\lib\site-packages\torch\tensor.py", line 166, in backward
   torch.autograd.backward(self, gradient, retain_graph, create_graph)
 File "C:\Users\Luis\Anaconda3\envs\unet-goolenet\lib\site-packages\torch\autograd\__init__.py", line 99, in backward
   allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'SqrtBackward' returned nan values in its 0th output.

Does it means that the gradient calculated by torch.srqt was nan at the following line?

dout = torch.sqrt(torch.pow(doutdx, 2) + torch.pow(doutdy, 2))

Luis · August 4, 2020, 1:58pm

Yes!

I replaced

dout = torch.sqrt(torch.pow(doutdx, 2) + torch.pow(doutdy, 2))

by

dout = (torch.pow(doutdx, 2) + torch.pow(doutdy, 2))

and now it seems to compute the backprojection fine! By removing the sqrt we do not lose the edge detection semantic, but now I am curious to understand why it is happening.

Do you have any idea?

ptrblck · August 5, 2020, 7:17am

torch.sqrt() would give you NaN gradients for negative inputs, which shouldn’t be possible given your code snippet and an Inf gradient for a zero input, which might be the case for your use case:

x = torch.tensor([0.], requires_grad=True)
y = torch.sqrt(x)
y.backward()
print(x.grad)
> tensor([inf])

You could add a small eps value to the sqrt op to avoid this. However, note that this might blow up your gradients due to the behavior of the derivative of the sqrt.