I am a bit puzzled by how to do this correctly: I want to predict a class label (nclasses possible) for each element (word) in a sequence (similar to POS tagging).
I tried a very easy architecture: an embedding layer for training embdims-dimensional embeddings, a one-directional LSTM with hidden layer dimension hidden, a linear layer from hidden to nclasses, and a SoftMax layer to get the nclasses probabilities. For the LSTM I use batch_first=True.
So if I have a batch size of batchsize, max sequence length seqlen, the input shape for the LSTM is (batchsize, seqlen, embdims) and the output shape is (batchsize, seqlen, hidden).
The output of the linear layer is (batchsize, sequlen, nclasses) and the output of the Softmax is the same.
Now originally I thought I wanted to use CrossEntropyLoss to calculate the loss. I calculated the loss as
crossentloss(predictions.view(-1, nclasses), labelindices(-1)) but this never worked, because apparently the crossentropyloss already includes the equivalent of the softmax layer, I think I understand that.
I then changed things so that instead of using the output of the SoftMax layer, the input to the SoftMax layer was used (the output of the linear layer) and calculated crossentloss(linearout.view(-1, nclasses), labelindices(-1)) but this also did not work.
What is the mistake? What puzzles me is that the CrossEntropyLoss function expects the shape (batchsize, classes) or (batchsize, classes, dim1, dim2 … ). So I essentially squeezed the sequence and batch dimensions into the batch dimension, but this should work since everything gets averaged over all those values anyway?
If instead of the Softmax layer I use a LogSoftmax layer and NLLLoss as the loss functions, things seem to work much better.
What would be the correct way to have a SoftMax layer at the end (so I get the actual probs) and use the appropriate loss function to work with that?