Loss becomes nan after few iterations

jpj · March 11, 2021, 4:15pm

# layers
self.hidden_layer_1 = torch.nn.Linear(self.input_neurons,self.hidden_neurons_1)
self.hidden_layer_2 = torch.nn.Linear(self.hidden_neurons_1,self.hidden_neurons_2)
self.output = torch.nn.Linear(self.hidden_neurons_2, self.output_neurons)

def forward(self, input):
        print(input.isnan().any()) # False
        H1 = self.hidden_layer_1(input)
        H1 = self.ReLU(H1)
        H2 = self.hidden_layer_2(H1)
        H2 = self.ReLU(H2)
        final_inputs = self.output(H2)
        # not applying activation on final_inputs because CrossEntropy does that
        return final_inputs

loss = self.CrossEntropyLoss(output, target)

I am using SGD optomizer with LR = 1e-2. I don’t understand why loss becomes nan after 4-5 iterations of the epoch. Previously, when I was using just one hidden layer the loss was always finite.

When I use sigmoid instead of relu, loss stays finite.

JuanFMontesinos · March 11, 2021, 11:57pm

Basically your model is diverging. Some loss is NaN and it “infects” the weights once it’s backpropagated.

I would bet your loss is increasing rather than decreasing.
If that’s the case try to reduce ur LR

jpj · March 12, 2021, 9:38am

The loss is actually decreasing. So what could be the reason I’m getting NaN after few iterations when using ReLU instead of sigmoid for the hidden layers?

rahulvigneswaran · March 12, 2021, 9:58am

@jpj There is an awesome PyTorch feature that lets you know where the NaN is coming from!

Documentation: Anomaly Detection

jpj · March 12, 2021, 10:21am

Hi, I did do that. I was getting “LogSoftMaxBackward returned nan in 0th output”. But I don’t understand the reason. From my code, you can see that the input had no nan values.

rahulvigneswaran · March 12, 2021, 11:13am

Can you print a few of the final_inputs and post here?

jpj · March 12, 2021, 11:20am

print(final_inputs,final_inputs.isnan().any())

tensor([[ 4.5512e+03,  2.2965e+03,  3.7338e+03, -1.1410e+03],
        [-1.2685e+04,  2.1673e+05,  2.7743e+05,  2.2019e+05],
        [ 1.3482e+03,  4.0191e+04,  1.5236e+04,  8.1835e+03],
        ...,
        [ 4.0465e+04,  2.3288e+04,  2.5682e+04,  6.2736e+03],
        [ 7.3675e+05, -9.9548e+05, -1.4421e+06,  1.7742e+04],
        [ 1.3802e+03,  9.2324e+02,  9.4439e+02,  3.3211e+01]],
       grad_fn=<AddmmBackward>) tensor(False)
Epoch 0, Iteration: 0.000, Loss:72828.328
tensor([[ 4.8688e+17, -5.5222e+16,  2.8847e+17, -7.2008e+17],
        [ 4.8875e+17, -5.5577e+16,  2.8895e+17, -7.2208e+17],
        [ 2.8952e+19, -3.2968e+18,  1.7122e+19, -4.2774e+19],
        ...,
        [ 2.7599e+15, -3.1438e+14,  1.6323e+15, -4.0776e+15],
        [ 6.9775e+18, -7.9428e+17,  4.1225e+18, -1.0305e+19],
        [ 2.7847e+19, -3.1713e+18,  1.6469e+19, -4.1142e+19]],
       grad_fn=<AddmmBackward>) tensor(False)
Epoch 0, Iteration: 1.000, Loss:479207024681287680.000
tensor([[-7.4644e+29,  9.6666e+25,  7.4367e+29,  2.6739e+27],
        [-2.0749e+33,  2.6870e+29,  2.0672e+33,  7.4327e+30],
        [-1.8951e+30,  2.4542e+26,  1.8880e+30,  6.7886e+27],
        ...,
        [-1.2698e+30,  1.6444e+26,  1.2650e+30,  4.5486e+27],
        [-1.7603e+33,  2.2796e+29,  1.7537e+33,  6.3057e+30],
        [-2.1945e+30,  2.8420e+26,  2.1864e+30,  7.8614e+27]],
       grad_fn=<AddmmBackward>) tensor(False)
Epoch 0, Iteration: 2.000, Loss:301857268183970173654388944404480.000
tensor([[ 1.2171e+14,  2.3199e+10, -1.2212e+14,  3.9526e+11],
        [ 1.2171e+14,  2.3199e+10, -1.2212e+14,  3.9526e+11],
        [ 1.2171e+14,  2.3199e+10, -1.2212e+14,  3.9526e+11],
        ...,
        [ 1.2171e+14,  2.3199e+10, -1.2212e+14,  3.9526e+11],
        [ 1.2171e+14,  2.3199e+10, -1.2212e+14,  3.9526e+11],
        [ 1.2171e+14,  2.3199e+10, -1.2212e+14,  3.9526e+11]],
       grad_fn=<AddmmBackward>) tensor(False)
Epoch 0, Iteration: 3.000, Loss:67027314147328.000
tensor([[-8.5834e+19,  1.3941e+16,  8.5509e+19,  3.1127e+17],
        [-8.5834e+19,  1.3941e+16,  8.5509e+19,  3.1127e+17],
        [-8.5834e+19,  1.3941e+16,  8.5509e+19,  3.1127e+17],
        ...,
        [-8.5834e+19,  1.3941e+16,  8.5509e+19,  3.1127e+17],
        [-8.5834e+19,  1.3941e+16,  8.5509e+19,  3.1127e+17],
        [-8.5834e+19,  1.3941e+16,  8.5509e+19,  3.1127e+17]],
       grad_fn=<AddmmBackward>) tensor(False)

After this point, I get this -RuntimeError: Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.

rahulvigneswaran · March 12, 2021, 11:38am

Can you also print final_inputs.shape?

rahulvigneswaran · March 12, 2021, 11:49am

@jpj Your inputs doesn’t have to be nan for you to get nan errors. LogSoftMax has a log term, denominator, and exponential term.

For example, the exponential term overflows for very large input and underflows for very negative inputs. Both will show nan error.

Is that the case here @ptrblck ?

ptrblck · March 12, 2021, 9:16pm

It could be the case.
However, in this particular use case the model is clearly diverging, as the loss explodes, so I would try to stabilize it.

J_Johnson · March 12, 2021, 11:55pm

Looks like you have an exploding gradients problem. If you are using the LogSoftMax function, you may want to apply clipping via softmaxoutput.clamp(min=1e-8) before any further processing.

jpj · March 15, 2021, 10:39am

I am using CrossEntropyLoss which applies LogSoftMax itself. Should I still clip it before with softmaxoutput.clamp(min=1e-8)?

jpj · March 15, 2021, 10:39am

How can I stabilize it?

Napam · March 15, 2021, 1:12pm

You could try training with smaller learning rates e.g. 1e-3, 1e-4, 1e-5, … (as @JuanFMontesinos had already mentioned)

J_Johnson · March 15, 2021, 1:55pm

You could try to put it into your model before it returns the outputs.

outputs.clamp(min=1e-8)
return outputs

Let me know if that works.

nwn · March 15, 2021, 3:09pm

As mentioned smaller learning rates might help, and/or you could try to decrease the hyperparameter for regularization strength, e.g. if you use weight decay.