I am training a classification model, and getting `nan`

loss values after some time. I used `torch.autograd.set_detect_anomaly(True)`

to trace the error and I get this as output:

`RuntimeError: Function 'ExpBackward0' returned nan values in its 0th output.`

I then used the following checks for values of variables and gradients in my training routine to find out where the problem arises first:

```
batch_predictions = model(batch_inputs)
batch_loss = loss(batch_predictions, batch_targets)
if torch.isfinite(batch_predictions).all() and torch.isfinite(batch_loss.item()):
print("outputs and loss ok")
else:
print("outputs and loss not ok")
batch_loss.backward()
grads = [p.grad for p in model.parameters() if p.requires_grad is True]
for grads_ in grads:
if torch.isnan(grads_).any():
print("grads nan")
else:
print("grads ok")
optimizer.step()
```

After some time I get `grads nan`

as output, but the immediately preceding `outputs and loss ok`

is also printed. This means that the outputs are ok, the loss is ok but the gradient calculations with `batch_loss.backward()`

leads to `nan`

gradients being calculated. I have tried changing the optimizer and reducing the learning rate, but nothing works. I am not sure why this is happening, and how to probe further and correct this. Thanks in advance for any help.