Validation Loss vs. Training Loss

I see in a code that they are using cross entropy to calculate the training loss but they calculate the validation loss as follows:

            p = F.softmax(prediction, dim=1).mean(dim=2)
            losses = {}
            losses['loss'] = - torch.log(p[range(p.shape[0]), target]).mean() 
            losses['xe'] = - torch.log(p[range(p.shape[0]), target]).mean()

I am wondering why not using cross_entropy here as well. and what does this formula represent?

Hi Samster!

I can’t really think of a reason to use this not-quite-cross-entropy loss
for the validation data. (In general, I can’t think of a good reason to use
different losses for training and validation.)

This loss is very similar to cross entropy. The softmax() converts the
prediction (to be understood as logits) to probabilities that are then
averaged over dimension-2. The actual cross-entropy piece of the
computation then takes the log() of these (averaged) probabilities and
averages the result over the remaining dimensions.

Plain-vanilla cross entropy (averaged over batch and any other dimensions)
simply averages the relevant log-probabilities over all of the dimensions.

Note that in the special case that dimension-2 is of size 1, no actual
averaging takes place over dimension-2, so the computation reduces to
the conventional cross entropy.

Consider:

>>> print (torch.__version__)
1.13.0
>>>
>>> _ = torch.manual_seed (2022)
>>>
>>> prediction = torch.randn (3, 5, 2)
>>> target = torch.randint (5, (3,))
>>>
>>> # averages probabilities across dimension 2
>>> p = torch.nn.functional.softmax(prediction, dim=1).mean(dim=2)
>>> # then averages log-probabilities across remaining dimensions
>>> lossA = - torch.log(p[range(p.shape[0]), target]).mean()
>>>
>>> # averages log-probabilities across all dimensions
>>> lossB = torch.nn.functional.cross_entropy (prediction, target.unsqueeze (-1).expand (-1, prediction.shape[2]))
>>>
>>> lossA, lossB   # similar, but not the same
(tensor(1.6720), tensor(1.8071))
>>>
>>> prediction = torch.randn (3, 5, 1)   # dimension-2 is trivial so averaging doesn't matter
>>>
>>> # no actual average across dimension 2
>>> p = torch.nn.functional.softmax(prediction, dim=1).mean(dim=2)
>>> # conventional cross entropy -- log-probabilities averaged across all non-trivial dimensions
>>> lossA = - torch.log(p[range(p.shape[0]), target]).mean()
>>>
>>> # conventional cross entropy
>>> lossB = torch.nn.functional.cross_entropy (prediction, target.unsqueeze (-1).expand (-1, prediction.shape[2]))
>>>
>>> lossA, lossB   # the same
(tensor(1.6183), tensor(1.6183))

This does seem kind of oddball to me and I don’t see any motivation for
doing things this way, but, who knows, maybe there is some method to
the madness.

Best.

K. Frank

1 Like