Hi Samster!

I can’t really think of a reason to use this not-quite-cross-entropy loss

for the validation data. (In general, I can’t think of a good reason to use

different losses for training and validation.)

This loss is very similar to cross entropy. The `softmax()`

converts the

`prediction`

(to be understood as *logits*) to probabilities that are then

averaged over dimension-2. The actual cross-entropy piece of the

computation then takes the `log()`

of these (averaged) probabilities and

averages the result over the remaining dimensions.

Plain-vanilla cross entropy (averaged over batch and any other dimensions)

simply averages the relevant log-probabilities over *all* of the dimensions.

Note that in the special case that dimension-2 is of size 1, no actual

averaging takes place over dimension-2, so the computation reduces to

the conventional cross entropy.

Consider:

```
>>> print (torch.__version__)
1.13.0
>>>
>>> _ = torch.manual_seed (2022)
>>>
>>> prediction = torch.randn (3, 5, 2)
>>> target = torch.randint (5, (3,))
>>>
>>> # averages probabilities across dimension 2
>>> p = torch.nn.functional.softmax(prediction, dim=1).mean(dim=2)
>>> # then averages log-probabilities across remaining dimensions
>>> lossA = - torch.log(p[range(p.shape[0]), target]).mean()
>>>
>>> # averages log-probabilities across all dimensions
>>> lossB = torch.nn.functional.cross_entropy (prediction, target.unsqueeze (-1).expand (-1, prediction.shape[2]))
>>>
>>> lossA, lossB # similar, but not the same
(tensor(1.6720), tensor(1.8071))
>>>
>>> prediction = torch.randn (3, 5, 1) # dimension-2 is trivial so averaging doesn't matter
>>>
>>> # no actual average across dimension 2
>>> p = torch.nn.functional.softmax(prediction, dim=1).mean(dim=2)
>>> # conventional cross entropy -- log-probabilities averaged across all non-trivial dimensions
>>> lossA = - torch.log(p[range(p.shape[0]), target]).mean()
>>>
>>> # conventional cross entropy
>>> lossB = torch.nn.functional.cross_entropy (prediction, target.unsqueeze (-1).expand (-1, prediction.shape[2]))
>>>
>>> lossA, lossB # the same
(tensor(1.6183), tensor(1.6183))
```

This does seem kind of oddball to me and I don’t see any motivation for

doing things this way, but, who knows, maybe there is some method to

the madness.

Best.

K. Frank