# About evaluating (and understanding?) the output of CrossEntropyLoss

So while working on a Language model I was testing some individual values with nn.CrossEntropyLoss

``````toyloss = nn.CrossEntropyLoss(reduction='none')

toyinput = torch.zeros(1, 5)
toytarget = torch.zeros(1, dtype=torch.long)

toytarget = 2
toyinput[0,2] = 1
toyoutput = toyloss(toyinput, toytarget)

print(toyoutput)
``````

In short the input and targets are:

tensor([[0., 0., 1., 0., 0.]], tensor()

respectively, and the loss is:

tensor([0.9048])

My question is why is the loss 0.9048 if the prediction completely matches the target?
My guess is that it has something to do with the fact that this Loss implements logSoftmax, but then, what should we the correct interpretation for this result?
Is 0.9048 the minimum loss possible when using this criterion?

Edit: I decided to try using only NLLLoss and I’ve got a loss of

tensor([-1.])

Honestly, I was expecting it to be 0… Am I missing something about the definition of Loss? I mean aren’t we looking for the closest value to 0?

Hi Daniel!

Yes, this is the key to the explanation of what’s going on.

Your prediction – the input to `CrossEntropyLoss` needs to be
understood as a set of logits* that get turned into probabilities
by `CrossEntropyLoss`'s implicit `softmax()`.

``````softmax ([0, 0, 1, 0, 0]) = [0.1488, 0.1488, 0.4046, 0.1488, 0.1488]
``````

So, although your prediction for “class 2” is the largest, favored
prediction, you are, nonetheless, only predicting “class 2” with about
40% probability. So, you’re less than half right and get a non-zero loss.

Note that `-log (0.4046) = 0.9048`, correctly reproducing the value

No, the minimum loss is zero. But to get it, you need to predict
“class 2” with a probability of 1, not a logit of 1. To get a probability
of 1, you need a logit of `+inf` (and a logit of `-inf` to get a probability
of 0). Because of the `exp()` in `softmax()`, `1000` is effectively `inf`.

So, if you use `[-1000, -1000, 1000, -1000, -1000]` as your
prediction, you will, in fact, be predicting “class 2” with (essentially)
100% probability, and you will get 0 for your loss.

This is because `NLLLoss` expects log-probabilities for the predictions
you give it (rather than probabilities). Pass into `NLLLoss` the prediction

``````log ([0, 0, 1, 0, 0]) = [-inf, -inf, 0, -inf, -inf]
``````

and you will get your expected loss of 0.

Not really.

Not exactly. The pytorch optimizers are “looking for” the algebraically
smallest (least positive, most negative) loss, rather than the loss closest
to zero. (The optimizers don’t really care where zero is.) So training on
`new_loss = old_loss - 10,000` will give you the same result
as training on `old_loss`.

*) To be completely precise the prediction is a set of logits ± some
arbitrary shift that gets washed away when you pass the shifted logits
though `softmax()`.

Best.

K. Frank