I am training a classification model, and getting nan
loss values after some time. I used torch.autograd.set_detect_anomaly(True)
to trace the error and I get this as output:
RuntimeError: Function 'ExpBackward0' returned nan values in its 0th output.
I then used the following checks for values of variables and gradients in my training routine to find out where the problem arises first:
batch_predictions = model(batch_inputs)
batch_loss = loss(batch_predictions, batch_targets)
if torch.isfinite(batch_predictions).all() and torch.isfinite(batch_loss.item()):
print("outputs and loss ok")
else:
print("outputs and loss not ok")
batch_loss.backward()
grads = [p.grad for p in model.parameters() if p.requires_grad is True]
for grads_ in grads:
if torch.isnan(grads_).any():
print("grads nan")
else:
print("grads ok")
optimizer.step()
After some time I get grads nan
as output, but the immediately preceding outputs and loss ok
is also printed. This means that the outputs are ok, the loss is ok but the gradient calculations with batch_loss.backward()
leads to nan
gradients being calculated. I have tried changing the optimizer and reducing the learning rate, but nothing works. I am not sure why this is happening, and how to probe further and correct this. Thanks in advance for any help.