Receiving 'nan' parameters after first optimization step

muhammadirfanzafar · January 28, 2021, 2:35pm

I am using a 5 layers fully connected neural network with tanh() activation function. I using it in PINN model, which has worked fine for several times before. It’s not this time.

When i use torch.autograd.set_detect_anomaly(True), following error message appears.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-21-268a4603ec9c> in <module>
----> 1 loss.backward()
      2 optimizer.step()

~/.conda/envs/nr_powerai36/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    193                 products. Defaults to ``False``.
    194         """
--> 195         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    196 
    197     def register_hook(self, hook):

~/.conda/envs/nr_powerai36/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: Function 'ReciprocalBackward' returned nan values in its 0th output.

I am not sure how to tackle this error. What exactly is ‘ReciprocalBackward’ function?

ptrblck · January 30, 2021, 8:29am

ReciprocalBackward points towards a division by the tensor:

x = torch.randn(1, requires_grad=True)
y1= 1/x
y2 = torch.reciprocal(x)

print(y1 == y2)
> tensor([True])
print(y2)
> tensor([-1.2178], grad_fn=<ReciprocalBackward>)

I assume the used tensor might be zero (or close to zero), which would yield Inf as the output and thus also an invalid gradient.
Could you check your model for these operations and make sure the used values are in a reasonable range?