Backward pass of sobel filter loss gives NaN

Hi! Thanks in advance for anyone that will take a look at my problem.

Here is my custom loss function:

``````def mean_gradient_error(outputs, targets, weight=1e-4):
print()
filter_x = torch.tensor([[-1, -2, -1], [0, 0, 0], [1, 2, 1]], dtype=outputs.dtype, device=outputs.device)
filter_y = torch.tensor([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]], dtype=outputs.dtype, device=outputs.device)

filter_x = filter_x.unsqueeze(0).expand(3, -1, -1).unsqueeze(0)
filter_y = filter_y.unsqueeze(0).expand(3, -1, -1).unsqueeze(0)

sq_output_gradient_y.register_hook(lambda y: print('sq_y back', y.min().item(), y.max().item()))

mge = torch.sum(diff / (shape[0] * shape[1]))

return mge * weight
``````

Following this conversation I added prints of values for every calculation and narrowed down, that error is between backward pass of `torch.sqrt(sq_output_gradient_x + sq_output_gradient_y)` and `torch.square(output_gradient_y)`, both of which I left prints in code above.

Here are the prints of last epochs:

``````[...]
sq_y 5.115907697472721e-13 1121.9591064453125
sq_y back -1.497764174018812e-06 1.525878978725359e-09

sq_y 5.1159076974727213e-11 1134.2796630859375
sq_y back -3.1196209420158993e-06 1.525878978725359e-09

sq_y 0.0 919.6836547851562
sq_y back -inf 1.525878978725359e-09
``````

As you can see backward pass of `torch.square` produces `-inf`, which I don’t know why.

Here is autograd error and python error:

``````[...]
if await self.run_code(code, result, async_=asy):
File "/home/adrian/.miniforge3/envs/torch/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "/tmp/ipykernel_34168/3323993857.py", line 15, in <module>
loss, info = train_step(model, mse_loss, scaler, small_img, img)
File "/tmp/ipykernel_34168/187903903.py", line 3, in train_step
_, loss, info = autocast_forward(model, criterion, small_img, img)
File "/tmp/ipykernel_34168/1175262128.py", line 8, in autocast_forward
File "/tmp/ipykernel_34168/3758660440.py", line 14, in mean_gradient_error
return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
``````
``````[...]
File ~/.miniforge3/envs/torch/lib/python3.12/site-packages/torch/autograd/graph.py:744, in _engine_run_backward(t_outputs, *args, **kwargs)
742     unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
743 try:
--> 744     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
745         t_outputs, *args, **kwargs
746     )  # Calls into the C++ engine to run the backward pass
747 finally:
748     if attach_logging_hooks:

RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
``````

I already tried gradient clipping to norm `1.0` but it didn’t change anything. I assume that my loss function is somehow numerically instable, but dunno how to stabilize.

It seems that your function crashes after raising a tensor to a given power. One thing you can do is run your code within a `torch.autograd.set_detect_anomaly` context manager (docs here: Automatic differentiation package - torch.autograd — PyTorch 2.4 documentation) and that will raise the line which is causing the `NaN` value.

I forgot to add that I get this error logs from enabling anomaly detection. In autograd logs, it points to `sq_output_gradient_y = torch.square(output_gradient_y)` which maps to prints statements as this is where it produced -infinity.

I don’t know why, but adding `1e-16` to `torch.square`

`sq_output_gradient_y = torch.square(output_gradient_y + 1e-16)`

seems to make it stable. I thought that maybe this would be enough to make `-inf` value go out of `-inf`, and it worked. (It even works without any gradient clipping)

I’m not marking question as solved, as I don’t understand why it is needed for numerical stability, if squaring any number shouldn’t make it go to negative infinity.

Unfortunately, this “solution” doesn’t work with mixed precision. No matter the `weight` value nor value to torch.square, it instantly gives nan in the first forward pass.