Loss.item() is yielding nan in the 0th output and the source is the function 'LogSoftmaxBackward0'

itorch · November 14, 2022, 6:20am

I am trying to train an NN with my own custom dataset. I could get the train function to run successfully, but my loss is returning nan in its 0th output and kept remaining the same for 25th epochs. Here are lines that are returning nan for loss.

loss = criterion(outputs, targets)

back propagation

loss.backward()

Anomaly detection reveals that the issue is happening during backpropagation and here is the detail of the error.

RuntimeError                              Traceback (most recent call last)
<ipython-input-185-c7fbdc8527d9> in <module>
      1 # Train the NN
      2 model = train_model(model, criterion, optimizer, exp_lr_scheduler,
----> 3                        num_epochs=25)

2 frames
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    173     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    174         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
    176 
    177 def grad(

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

Based on the threads on similar errors online, I tried adding small epsilon = 1e-6, but that did not solve the issue either.

ptrblck · November 14, 2022, 6:31am

Check if your model output is already containing invalid values, since the error message would point to it:

torch.autograd.set_detect_anomaly(True)

criterion = nn.CrossEntropyLoss()

x = torch.tensor([[-10., float('Inf')]], requires_grad=True)
target = torch.tensor([0])

loss = criterion(x, target)
loss.backward()
# RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

If so, then add debug print statements to your forward and check which operation is creating these Infs or NaNs.

itorch · November 15, 2022, 12:36am

I have gotten the error message above when running anomaly detection, i.e.

torch.autograd.set_detect_anomaly(True)

Sorry, I am unfamiliar with this error. Can you please be more specific about your suggestion “then add debug print statements to your forward and check which operation is creating these Infs or NaNs.”?

In a related note, the network seems to be training the input data. But, I am currently using a tiny fraction of my actual training dataset for sanity check. the nan loss is occurring when the accuracy stops improving, in essence:

Epoch 0/24
----------
training Loss: 21295768.1193 Acc: 0.2687

Epoch 1/24
----------
training Loss: nan Acc: 0.5250

Epoch 2/24
----------
training Loss: nan Acc: 0.5250

Epoch 3/24
----------
training Loss: nan Acc: 0.5250

Epoch 4/24
----------
training Loss: nan Acc: 0.5250

Is this expected?

ptrblck · November 15, 2022, 12:43am

Your loss seems to explode, so I would recommend fixing it first since this would create these issues once you are running into an overflow. Make sure the initial loss value starts at a reasonable value and decreases. Right now, the first printed loss value is already too high (21295768.1193) and overflows to NaN afterwards.

itorch · November 15, 2022, 4:39am

I guess, a very small and non-diverse training dataset is a potential reason for the initial loss to be so high, right?

ptrblck · November 15, 2022, 5:57am

No, I don’t think a small dataset is responsible for a huge loss value as your model should be able to overfit it easily, so I would check for other issues in your training code (e.g. if you might have forgotten to zero out the gradients, if the target contains valid values etc.).