The NaNs appear, because softmax + log separately can be a numerically unstable operation.
If you’re using CrossEntropyLoss
for training, you could use the F.log_softmax
function at the end of your model and use NLLLoss
. The loss will be equivalent, but much more stable.