Nan Loss coming after some time

Hi @tom
Thanks for your reply.
Can you also suggest how can I check if the gradients are becoming NaN first and also how can I ensure gradient clipping with a SGD or AdaGrad Optimizer.

Thanks

I tried doing it. But how should I decide the mean and standard deviation for the operation?

Thanks

You could use a normalization layer. Alternatively, you can try dividing by some constant first (perhaps equal to the max value of your data?) The idea is to get the values low enough that they don’t cause really large gradients.

1 Like

Hi, I am facing the same issue, I am wondering if you have solved it please?

Have you tried the suggestions from this thread and is nothing working?
If so, when do you see the first NaN value?
Could you additionally check out input for NaN values?

Thanks for reply, none of the suggestion in this thread worked for me.
Finally, I have solved my issue by your suggestion in another thread Getting Nan after first iteration with custom loss .

1 Like

Here is a way of debuging the nan problem.
First, print your model gradients because there are likely to be nan in the first place.
And then check the loss, and then check the input of your loss…Just follow the clue and you will find the bug resulting in nan problem.

There are some useful infomation about why nan problem could happen:
1.the learning rate
2.sqrt(0)
3.ReLU->LeakyReLU

3 Likes

Why is sqrt(0) a Nan? Should it not be equal to 0?

1 Like

I have 100 folders of different class images and I am getting nan loss value in some folders, I already checked gray scale,truncated,missing labels etc everything is fine but still getting ‘nan’ loss. What could be the possible reason ?

Do you get a NaN output from your model if you are using samples from certain folders or what do you mean by:

If your model is returning NaNs, you could set torch.autograd.detect_anomaly(True) at the beginning of your script to get a stack trace, which would hopefully point to the operation, which is creating the NaNs.

I am getting ‘nan’ loss after 1st epoch on a large dataset, please tell me all possible reasons for ‘nan’ loss value.
check this :

dict_values([tensor(5.5172, device='cuda:0', grad_fn=<NllLossBackward>), tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), tensor(3.7665, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>)])

NaN values can be created by invalid operations, such as torch.log(torch.tensor(-1.)), by operations executed on Infs (created through over-/underflows) etc.
To isolate it, use torch.autograd.set_detect_anomaly(True).

3 Likes

Thanks for reply.
Can I just use gradient clipping? If yes,how can I choose clip value ?

If larger gradient magnitudes are expected and would thus create invalid values, you might clip the gradients. You could start with a max norm value of 1 or refer to any paper, which uses a similar approach.

Note however, that FloatTensors have a maximal value of:

print(torch.finfo().max)
> 3.4028234663852886e+38

so you should make sure that the NaNs are not created by an invalid operation.

3 Likes

by using torch.autograd.set_detect_anomaly(True) I found this error. My dataset seems ok so to resolve this issue should I use gradient clipping or just ignore ‘nan’ values using torch.isnan(x) ?

RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output

I would recommend to try to figure out what is causing the NaNs instead of ignoring them.
Based on the raised error, the loss function might either have created the NaNs or might have gotten them through their input.

To isolate it, you could try to make the script deterministic following the reproducibility docs. Once it’s deterministic and you could trigger the NaNs in a single step, you could check the parameters, inputs, gradients etc. for the iteration which causes the NaNs.

1 Like

I don’t know if this applies to this case and I made sure nothing’s wrong with my data, but I see nans after some time when I use RMSProp but not with Adam. Try changing your optimizer maybe? A similar experience is shared for Keras as well.

I’m also encountering a similar problem for my model. After a few iterations of training on graph data, loss which is MSELoss function between the returned output and a fixed label become NaN.

Model:

from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module
import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn
import networkx as nx
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

class Model(nn.Module):
    def __init__(self, nin, nhid1, nout, inp_l, hid_l, out_l=1):
        super(Model, self).__init__()

        self.g1 = GCNConv(in_channels= nin, out_channels= nhid1)
        self.g2 = GCNConv(in_channels= nhid1, out_channels= nout)
        self.dropout = 0.5
        self.lay1 = nn.Linear(inp_l ,hid_l)
        self.lay2 = nn.Linear(hid_l ,out_l)

    def forward(self, x, adj):
        x = F.relu(self.g1(x, adj))
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.g2(x, adj)
        
        x = self.lay1(x)
        x = F.relu(x)
        x = self.lay2(x)
        x = F.relu(x)
        
        return x

The inputs to the model:

x (Tensor , optional ) – Node feature matrix with shape [num_nodes, num_node_features] .
edge_index (LongTensor , optional ) – Graph connectivity in COO format with shape [2, num_edges]

Here num_nodes=1000 ; num_node_features=1 ; num_edges = 5000

[GCNConv](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) is a graph embedder returns a [num_nodes, dim] matrix

I would recommend to check the input and targets for invalid values first.
If there are no invalid values, then you could observe the loss and check if it’s blowing up (if so reduce the learning rate). If that’s also not the case, then you should check all intermediate activations to track down the operation, which creates the NaNs e.g. via forward hooks.

The inputs don’t have any invalid values and the loss is also close to 0. Can you please elaborate on the checking of intermediate activation functions on how exactly to detect the issue there? Thanks,