Nan Loss coming after some time

I have 100 folders of different class images and I am getting nan loss value in some folders, I already checked gray scale,truncated,missing labels etc everything is fine but still getting ‘nan’ loss. What could be the possible reason ?

Do you get a NaN output from your model if you are using samples from certain folders or what do you mean by:

If your model is returning NaNs, you could set torch.autograd.detect_anomaly(True) at the beginning of your script to get a stack trace, which would hopefully point to the operation, which is creating the NaNs.

I am getting ‘nan’ loss after 1st epoch on a large dataset, please tell me all possible reasons for ‘nan’ loss value.
check this :

dict_values([tensor(5.5172, device='cuda:0', grad_fn=<NllLossBackward>), tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), tensor(3.7665, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>)])

NaN values can be created by invalid operations, such as torch.log(torch.tensor(-1.)), by operations executed on Infs (created through over-/underflows) etc.
To isolate it, use torch.autograd.set_detect_anomaly(True).

6 Likes

Thanks for reply.
Can I just use gradient clipping? If yes,how can I choose clip value ?

If larger gradient magnitudes are expected and would thus create invalid values, you might clip the gradients. You could start with a max norm value of 1 or refer to any paper, which uses a similar approach.

Note however, that FloatTensors have a maximal value of:

print(torch.finfo().max)
> 3.4028234663852886e+38

so you should make sure that the NaNs are not created by an invalid operation.

4 Likes

by using torch.autograd.set_detect_anomaly(True) I found this error. My dataset seems ok so to resolve this issue should I use gradient clipping or just ignore ‘nan’ values using torch.isnan(x) ?

RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output

I would recommend to try to figure out what is causing the NaNs instead of ignoring them.
Based on the raised error, the loss function might either have created the NaNs or might have gotten them through their input.

To isolate it, you could try to make the script deterministic following the reproducibility docs. Once it’s deterministic and you could trigger the NaNs in a single step, you could check the parameters, inputs, gradients etc. for the iteration which causes the NaNs.

2 Likes

I don’t know if this applies to this case and I made sure nothing’s wrong with my data, but I see nans after some time when I use RMSProp but not with Adam. Try changing your optimizer maybe? A similar experience is shared for Keras as well.

I’m also encountering a similar problem for my model. After a few iterations of training on graph data, loss which is MSELoss function between the returned output and a fixed label become NaN.

Model:

from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module
import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn
import networkx as nx
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

class Model(nn.Module):
    def __init__(self, nin, nhid1, nout, inp_l, hid_l, out_l=1):
        super(Model, self).__init__()

        self.g1 = GCNConv(in_channels= nin, out_channels= nhid1)
        self.g2 = GCNConv(in_channels= nhid1, out_channels= nout)
        self.dropout = 0.5
        self.lay1 = nn.Linear(inp_l ,hid_l)
        self.lay2 = nn.Linear(hid_l ,out_l)

    def forward(self, x, adj):
        x = F.relu(self.g1(x, adj))
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.g2(x, adj)
        
        x = self.lay1(x)
        x = F.relu(x)
        x = self.lay2(x)
        x = F.relu(x)
        
        return x

The inputs to the model:

x (Tensor , optional ) – Node feature matrix with shape [num_nodes, num_node_features] .
edge_index (LongTensor , optional ) – Graph connectivity in COO format with shape [2, num_edges]

Here num_nodes=1000 ; num_node_features=1 ; num_edges = 5000

[GCNConv](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) is a graph embedder returns a [num_nodes, dim] matrix

I would recommend to check the input and targets for invalid values first.
If there are no invalid values, then you could observe the loss and check if it’s blowing up (if so reduce the learning rate). If that’s also not the case, then you should check all intermediate activations to track down the operation, which creates the NaNs e.g. via forward hooks.

2 Likes

The inputs don’t have any invalid values and the loss is also close to 0. Can you please elaborate on the checking of intermediate activation functions on how exactly to detect the issue there? Thanks,

You can use forward hooks as described here to check all intermediate outputs for NaN values.
Since the inputs are valid and the loss doesn’t seem to explode, I guess a particular layer might create these invalid outputs, which are then propagated to the loss calculation.

I checked again. My loss function seems fixated and isn’t changing with the no. of epochs:

Epoch: 0001 loss_train: -0.0155 time: 0.0748s
Time Net =  0.07779312133789062
Epoch: 0051 loss_train: -0.0154 time: 0.0160s
Time Net =  1.1269891262054443
Epoch: 0101 loss_train: -0.0153 time: 0.0170s
Time Net =  1.9906792640686035
Epoch: 0151 loss_train: -0.0153 time: 0.0170s
Time Net =  2.8633458614349365

So, I checked for the gradient flow.

Apparently increasing the number of linear layers still gives the same loss_train value and gradient flow remains the same.

What could be the reason here?

I would focus on the NaN issue first before diving into the model and checking the gradient flow, as the former issue is more concerning.
Were you able to isolate the first NaN output?

I figured out that the loss is becoming NaN due to an objective function which I’m using for unsupervised learning. One of the steps in this objective function is to normalize the output of of the model i.e., Y/ ||Y||.

Whenever Y is a zero tensor, this normalization creates a NaN output. So, there is some issue with the model as the output comes out as a zero tensor sometimes. Also, as mentioned in the previous post, the loss doesn’t tend to change with increasing no. of epochs.

So, what should I do to check why the output is coming as a zero-tensor?

I’m not sure what to check first. Is your objective function forcing the model to output a zero tensor in some way, which might then create the NaN outputs?

The architecture is as shown below. The model output is fed into the objective function. The output of the objective function and the model are fed into loss function.

But, the model output itself is coming out as a zero-tensor which will be fed into the objective function. So, what is to be done here? Any ideas why the model output comes across as a zero-valued tensor?
Thanks.

Also, the loss doesn’t seem to be changing as mentioned in the previous post. Is there way to check why this might be happening?

The loss seems to decrease, but really slow.
Checking the gradients, as you already did, is a valid way to see, if you have accidentally broken the computation graph. Since that doesn’t seem to be the case, you would have to verify if your approach using the custom objective function etc. works at all.
To do so I would recommend to try to overfit a small dataset, e.g. just 10 samples, and make sure your current training routine and model are able to overfit this dataset.

1 Like