Cross entropy loss on text generation

I was trying to read up on some seq to seq models for translation, and i saw that in a very common model, the loss was used as cross entropy loss and the way it was used was

dimension sizes ->

trg = [(trg sent len - 1) * batch size]
output = [(trg sent len - 1) * batch size, output dim]

where the output dim was the target vocab size. now my question is how is this loss working given the dimensions are not equal, what is it computing and how it is computing?

my one theory is that they perform output.argmax(1) and then compute the loss.

nn.CrossEntropyLoss expects the output tensor to contain the class logits in dim1, while this particular dimension is missing in the target tensor, which contains the class indices.

A vanilla multi-class classification output would therefore be defined as [batch_size, nb_classes], while the target would only be [batch_size].

For a segmentation use case, you could use [batch_size, nb_classes, height, width] for the output and [batch_size, height, width] for the target.

In your example, it seems the temporal dimension was collapsed with the batch size into dim0.