nn.LSTM with nn.NLLLoss: confusion between batch size and sequence length

yaskovdev · May 11, 2024, 9:44pm

Good evening!

I am trying to follow the Sequence Models and Long Short-Term Memory Networks tutorial. I’m confused about how to use batches with nn.LSTM.

The tutorial says at the beginning: “The first axis is the sequence itself, the second indexes instances in the mini-batch, and the third indexes elements of the input. We haven’t discussed mini-batching, so let’s just ignore that and assume we will always have just 1 dimension on the second axis.” This suggests that we will be dealing with batches of size 1.

But then the tutorial instanciates the nn.NLLLoss function and in the first iteration passes tag_scores of shape (5, 3) and targets of shape (5) to it. According to the docs of the loss function, the first dimension is the size of the mini-batch. Which suggests that in the tutorial the batches are of size 5 now.

Which is correct? Isn’t 5 the sequence length (the length of the “the dog ate the apple” sentence), not the mini-batch size?
If instead of 1 sequence of length 5 I had, say, a mini-batch of 32 sequences of length 5, what would be the shape of the nn.NLLLoss input then? (160, 3), (160)?

Thank you.

vdw · May 12, 2024, 4:13am

Yes, the sequence length is 5. The 3 stems from the 3 probabilities that you get for each token since you have 3 possible tags (in this simplified) example here. For example, if you replace

tag_scores = F.log_softmax(tag_space, dim=1)

with

tag_scores = F.softmax(tag_space, dim=1)

the 3 probabilities in each of the 5 rows should sum up to 1. The value in targets are 0, 1, or 2, reflecting with class/probability is the correct one for each token.

Side note: I’m note sure, but extending this tutorial to mini-batches might be tricky. This tutorial uses a lot of view() commands that might cause problems.

yaskovdev · May 16, 2024, 9:46am

Thank you for the reply, @vdw!

So, if I understand correctly, if the output of my nn.LSTM layer has a shape of (BATCH_SIZE, SEQ_LENGTH, HIDDEN_DIM), then I the arguments of the nn.NLLLoss instance (its input and target) should have the next shape: (BATCH_SIZE * SEQ_LENGTH, HIDDEN_DIM) and (BATCH_SIZE * SEQ_LENGTH).

At least when I try calling the nn.NLLLoss instance this way, it works as expected.

vdw · May 16, 2024, 12:27pm

You need to be careful! Have a look how the nn.LSTM layer has been defined:

self.lstm = nn.LSTM(embedding_dim, hidden_dim)

This means it uses the default input parameter batch_first=False. If you check the docs for nn.LSTM, the shape of lstm_out is (seq_len, batch_size, hidden_size) – not this also assume that bidirectional=False.

Before you push this through self.hidden2tag, you need to get it into the shape of (batch_size, seq_len, hidden_size). The tutorial uses .view() for that which I would highly discourage. I would recommend

lstm_out = lstm_out.transpose(1,0)

or

lstm_out = lstm_out.permute(1,0,2)

Then, after the call on that “corrected” lstm_out

tag_space = self.hidden2tag(lstm_out)

tag_space will have a shape of (batch_size, seq_len, 3), which is what you want: for each sequence in your batch and each word in a sequence, you want the 3 probabilities.

In the general case – i.e., when using batches with more than 1 sequence – the shape of targets then is (batch_size, seq_len), where the values at each position of the tensor are 0, 1, or 2, indicating the correct class / label.

IMPORTANT: Since this tutorial uses only batches of size 1, things are simpler and the .view() command probably works. For batches of size > 1, I’m doubt this code will still work properly.

I have a complete example for an RNN-based NER tagger, which is the same setup but predicting NER tags for each token/word instead of POS tags. Here’s the notebook and the corresponding script. Note that it defines the nn.LSTM layer with bidirectional=True and batch_first=True. But conceptually, it’s the same setup/architecture.

Important is also the line

loss = criterion(outputs.permute(0,2,1), targets)

more specifically the .permute(0,2,1). Again, you have to check the docs for the loss function to see that the expected input shape is (batch_size, target_size, seq_len)

I’m fairly certain that this only runs because batch_size=1 and wouldn’t otherwise. Of course, have larger batches requires to handle sequences of different lengths – hence, this tutorial ignores this, I would think. In the notebook I link, I use a custom sampler that generates batches that always contain sequences of the same length.

yaskovdev · May 17, 2024, 4:47pm

Thank you for the detailed explanation, @vdw!

I tried:

Using batch_size bigger than 1,
Calling my loss function similarly to how you do it in your notebook (criterion(outputs.permute(0,2,1), targets)),
Setting batch_first to its default value and transposing the LSTM input / output correspondingly.

Everything worked as expected.

To be honest, I did not find a good reason for batch_first to be False by default (including this forum): it seems like it only leads to two extra transpose(1, 0) calls before and after the LSTM layer without readability / performance improvements, but that’s probably a separate topic.