How do I use nn.CrossEntropyLoss() for seq2seq where my predication is of size (BS, seq_len, vocab_size) and truth of size (BS, seq_len), for example
predication = torch.randn(2, 3, 5, requires_grad=True) # (BS, seq_len, vocab_size)
target = torch.empty(2, 3, dtype=torch.long).random_(5) # (BS, seq_len)
predication: # size = (2, 3, 5)
tensor([[[-1.3824, -1.4598, -0.3210, -0.2991, 0.2965],
[ 0.2591, -0.5094, -0.7029, 0.2963, -1.8912],
[ 2.0020, -1.1158, 1.1687, -0.5815, -0.4416]],
[[ 2.9818, 0.4093, 1.9568, 0.0664, -0.3604],
[-0.6369, -0.3365, -1.3922, -0.6929, -0.1229],
[ 0.6589, -1.3124, -2.0313, -1.4866, -1.8163]]], requires_grad=True)
target: # size = (2, 3)
tensor([[4, 3, 3],
[3, 2, 1]])
i.e. [-1.3824, -1.4598, -0.3210, -0.2991, 0.2965] is the probabilities of words in my vocabulary for the first word in first batch first sequence, which predicts word with label(or index)=4 in the vocabulary and that is the same as the ground truth.