I am trying to train an NN with my own custom dataset. I could get the train function to run successfully, but my loss is returning nan in its 0th output and kept remaining the same for 25th epochs. Here are lines that are returning nan for loss.
loss = criterion(outputs, targets)
back propagation
loss.backward()
Anomaly detection reveals that the issue is happening during backpropagation and here is the detail of the error.
RuntimeError Traceback (most recent call last)
<ipython-input-185-c7fbdc8527d9> in <module>
1 # Train the NN
2 model = train_model(model, criterion, optimizer, exp_lr_scheduler,
----> 3 num_epochs=25)
2 frames
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
176
177 def grad(
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
Based on the threads on similar errors online, I tried adding small epsilon = 1e-6, but that did not solve the issue either.
Check if your model output is already containing invalid values, since the error message would point to it:
torch.autograd.set_detect_anomaly(True)
criterion = nn.CrossEntropyLoss()
x = torch.tensor([[-10., float('Inf')]], requires_grad=True)
target = torch.tensor([0])
loss = criterion(x, target)
loss.backward()
# RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
If so, then add debug print statements to your forward and check which operation is creating these Infs or NaNs.
I have gotten the error message above when running anomaly detection, i.e.
torch.autograd.set_detect_anomaly(True)
Sorry, I am unfamiliar with this error. Can you please be more specific about your suggestion “then add debug print statements to your forward and check which operation is creating these Infs or NaNs.”?
In a related note, the network seems to be training the input data. But, I am currently using a tiny fraction of my actual training dataset for sanity check. the nan loss is occurring when the accuracy stops improving, in essence:
Epoch 0/24
----------
training Loss: 21295768.1193 Acc: 0.2687
Epoch 1/24
----------
training Loss: nan Acc: 0.5250
Epoch 2/24
----------
training Loss: nan Acc: 0.5250
Epoch 3/24
----------
training Loss: nan Acc: 0.5250
Epoch 4/24
----------
training Loss: nan Acc: 0.5250
Your loss seems to explode, so I would recommend fixing it first since this would create these issues once you are running into an overflow. Make sure the initial loss value starts at a reasonable value and decreases. Right now, the first printed loss value is already too high (21295768.1193) and overflows to NaN afterwards.
No, I don’t think a small dataset is responsible for a huge loss value as your model should be able to overfit it easily, so I would check for other issues in your training code (e.g. if you might have forgotten to zero out the gradients, if the target contains valid values etc.).