The tutorial says at the beginning: “The first axis is the sequence itself, the second indexes instances in the mini-batch, and the third indexes elements of the input. We haven’t discussed mini-batching, so let’s just ignore that and assume we will always have just 1 dimension on the second axis.” This suggests that we will be dealing with batches of size 1.
But then the tutorial instanciates the nn.NLLLoss function and in the first iteration passes tag_scores of shape (5, 3) and targets of shape (5) to it. According to the docs of the loss function, the first dimension is the size of the mini-batch. Which suggests that in the tutorial the batches are of size 5 now.
Which is correct? Isn’t 5 the sequence length (the length of the “the dog ate the apple” sentence), not the mini-batch size?
If instead of 1 sequence of length 5 I had, say, a mini-batch of 32 sequences of length 5, what would be the shape of the nn.NLLLoss input then? (160, 3), (160)?
Yes, the sequence length is 5. The 3 stems from the 3 probabilities that you get for each token since you have 3 possible tags (in this simplified) example here. For example, if you replace
tag_scores = F.log_softmax(tag_space, dim=1)
with
tag_scores = F.softmax(tag_space, dim=1)
the 3 probabilities in each of the 5 rows should sum up to 1. The value in targets are 0, 1, or 2, reflecting with class/probability is the correct one for each token.
Side note: I’m note sure, but extending this tutorial to mini-batches might be tricky. This tutorial uses a lot of view() commands that might cause problems.
So, if I understand correctly, if the output of my nn.LSTM layer has a shape of (BATCH_SIZE, SEQ_LENGTH, HIDDEN_DIM), then I the arguments of the nn.NLLLoss instance (its input and target) should have the next shape: (BATCH_SIZE * SEQ_LENGTH, HIDDEN_DIM) and (BATCH_SIZE * SEQ_LENGTH).
You need to be careful! Have a look how the nn.LSTM layer has been defined:
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
This means it uses the default input parameter batch_first=False. If you check the docs for nn.LSTM, the shape of lstm_out is (seq_len, batch_size, hidden_size) – not this also assume that bidirectional=False.
Before you push this through self.hidden2tag, you need to get it into the shape of (batch_size, seq_len, hidden_size). The tutorial uses .view() for that which I would highly discourage. I would recommend
lstm_out = lstm_out.transpose(1,0)
or
lstm_out = lstm_out.permute(1,0,2)
Then, after the call on that “corrected” lstm_out
tag_space = self.hidden2tag(lstm_out)
tag_space will have a shape of (batch_size, seq_len, 3), which is what you want: for each sequence in your batch and each word in a sequence, you want the 3 probabilities.
In the general case – i.e., when using batches with more than 1 sequence – the shape of targets then is (batch_size, seq_len), where the values at each position of the tensor are 0, 1, or 2, indicating the correct class / label.
IMPORTANT: Since this tutorial uses only batches of size 1, things are simpler and the .view() command probably works. For batches of size > 1, I’m doubt this code will still work properly.
I have a complete example for an RNN-based NER tagger, which is the same setup but predicting NER tags for each token/word instead of POS tags. Here’s the notebook and the corresponding script. Note that it defines the nn.LSTM layer with bidirectional=True and batch_first=True. But conceptually, it’s the same setup/architecture.
Important is also the line
loss = criterion(outputs.permute(0,2,1), targets)
more specifically the .permute(0,2,1). Again, you have to check the docs for the loss function to see that the expected input shape is (batch_size, target_size, seq_len)
I’m fairly certain that this only runs because batch_size=1 and wouldn’t otherwise. Of course, have larger batches requires to handle sequences of different lengths – hence, this tutorial ignores this, I would think. In the notebook I link, I use a custom sampler that generates batches that always contain sequences of the same length.
Calling my loss function similarly to how you do it in your notebook (criterion(outputs.permute(0,2,1), targets)),
Setting batch_first to its default value and transposing the LSTM input / output correspondingly.
Everything worked as expected.
To be honest, I did not find a good reason for batch_first to be False by default (including this forum): it seems like it only leads to two extra transpose(1, 0) calls before and after the LSTM layer without readability / performance improvements, but that’s probably a separate topic.