My question is why is the loss 0.9048 if the prediction completely matches the target?
My guess is that it has something to do with the fact that this Loss implements logSoftmax, but then, what should we the correct interpretation for this result?
Is 0.9048 the minimum loss possible when using this criterion?

Edit: I decided to try using only NLLLoss and I’ve got a loss of

tensor([-1.])

Honestly, I was expecting it to be 0… Am I missing something about the definition of Loss? I mean aren’t we looking for the closest value to 0?

The short answer is that your prediction doesn’t completely match
your target.

Yes, this is the key to the explanation of what’s going on.

Your prediction – the input to CrossEntropyLoss needs to be
understood as a set of logits* that get turned into probabilities
by CrossEntropyLoss's implicit softmax().

So, although your prediction for “class 2” is the largest, favored
prediction, you are, nonetheless, only predicting “class 2” with about
40% probability. So, you’re less than half right and get a non-zero loss.

Note that -log (0.4046) = 0.9048, correctly reproducing the value
you report for your loss.

No, the minimum loss is zero. But to get it, you need to predict
“class 2” with a probability of 1, not a logit of 1. To get a probability
of 1, you need a logit of +inf (and a logit of -inf to get a probability
of 0). Because of the exp() in softmax(), 1000 is effectively inf.

So, if you use [-1000, -1000, 1000, -1000, -1000] as your
prediction, you will, in fact, be predicting “class 2” with (essentially)
100% probability, and you will get 0 for your loss.

This is because NLLLoss expects log-probabilities for the predictions
you give it (rather than probabilities). Pass into NLLLoss the prediction

Not exactly. The pytorch optimizers are “looking for” the algebraically
smallest (least positive, most negative) loss, rather than the loss closest
to zero. (The optimizers don’t really care where zero is.) So training on new_loss = old_loss - 10,000 will give you the same result
as training on old_loss.

*) To be completely precise the prediction is a set of logits ± some
arbitrary shift that gets washed away when you pass the shifted logits
though softmax().