# nn.LSTM with nn.NLLLoss: confusion between batch size and sequence length

Good evening!

I am trying to follow the Sequence Models and Long Short-Term Memory Networks tutorial. I’m confused about how to use batches with `nn.LSTM`.

The tutorial says at the beginning: “The first axis is the sequence itself, the second indexes instances in the mini-batch, and the third indexes elements of the input. We haven’t discussed mini-batching, so let’s just ignore that and assume we will always have just 1 dimension on the second axis.” This suggests that we will be dealing with batches of size 1.

But then the tutorial instanciates the `nn.NLLLoss` function and in the first iteration passes `tag_scores` of shape `(5, 3)` and `targets` of shape `(5)` to it. According to the docs of the loss function, the first dimension is the size of the mini-batch. Which suggests that in the tutorial the batches are of size 5 now.

Which is correct? Isn’t 5 the sequence length (the length of the “the dog ate the apple” sentence), not the mini-batch size?
If instead of 1 sequence of length 5 I had, say, a mini-batch of 32 sequences of length 5, what would be the shape of the `nn.NLLLoss` input then? `(160, 3)`, `(160)`?

Thank you.

Yes, the sequence length is 5. The 3 stems from the 3 probabilities that you get for each token since you have 3 possible tags (in this simplified) example here. For example, if you replace

``````tag_scores = F.log_softmax(tag_space, dim=1)
``````

with

``````tag_scores = F.softmax(tag_space, dim=1)
``````

the 3 probabilities in each of the 5 rows should sum up to 1. The value in `targets` are 0, 1, or 2, reflecting with class/probability is the correct one for each token.

Side note: I’m note sure, but extending this tutorial to mini-batches might be tricky. This tutorial uses a lot of `view()` commands that might cause problems.

1 Like

Thank you for the reply, @vdw!

So, if I understand correctly, if the output of my `nn.LSTM` layer has a shape of `(BATCH_SIZE, SEQ_LENGTH, HIDDEN_DIM)`, then I the arguments of the `nn.NLLLoss` instance (its input and target) should have the next shape: `(BATCH_SIZE * SEQ_LENGTH, HIDDEN_DIM)` and `(BATCH_SIZE * SEQ_LENGTH)`.

At least when I try calling the `nn.NLLLoss` instance this way, it works as expected.

You need to be careful! Have a look how the `nn.LSTM` layer has been defined:

``````self.lstm = nn.LSTM(embedding_dim, hidden_dim)
``````

This means it uses the default input parameter `batch_first=False`. If you check the docs for `nn.LSTM`, the shape of `lstm_out` is `(seq_len, batch_size, hidden_size)` – not this also assume that `bidirectional=False`.

Before you push this through `self.hidden2tag`, you need to get it into the shape of `(batch_size, seq_len, hidden_size)`. The tutorial uses `.view()` for that which I would highly discourage. I would recommend

``````lstm_out = lstm_out.transpose(1,0)
``````

or

``````lstm_out = lstm_out.permute(1,0,2)
``````

Then, after the call on that “corrected” `lstm_out`

``````tag_space = self.hidden2tag(lstm_out)
``````

`tag_space` will have a shape of `(batch_size, seq_len, 3)`, which is what you want: for each sequence in your batch and each word in a sequence, you want the 3 probabilities.

In the general case – i.e., when using batches with more than 1 sequence – the shape of targets then is `(batch_size, seq_len)`, where the values at each position of the tensor are 0, 1, or 2, indicating the correct class / label.

IMPORTANT: Since this tutorial uses only batches of size 1, things are simpler and the `.view()` command probably works. For batches of size > 1, I’m doubt this code will still work properly.

I have a complete example for an RNN-based NER tagger, which is the same setup but predicting NER tags for each token/word instead of POS tags. Here’s the notebook and the corresponding script. Note that it defines the `nn.LSTM` layer with `bidirectional=True` and `batch_first=True`. But conceptually, it’s the same setup/architecture.

Important is also the line

``````loss = criterion(outputs.permute(0,2,1), targets)
``````

more specifically the `.permute(0,2,1)`. Again, you have to check the docs for the loss function to see that the expected input shape is `(batch_size, target_size, seq_len)`

I’m fairly certain that this only runs because `batch_size=1` and wouldn’t otherwise. Of course, have larger batches requires to handle sequences of different lengths – hence, this tutorial ignores this, I would think. In the notebook I link, I use a custom sampler that generates batches that always contain sequences of the same length.

1 Like

Thank you for the detailed explanation, @vdw!

I tried:

1. Using `batch_size` bigger than 1,
2. Calling my loss function similarly to how you do it in your notebook (`criterion(outputs.permute(0,2,1), targets)`),
3. Setting `batch_first` to its default value and transposing the LSTM input / output correspondingly.

Everything worked as expected.

To be honest, I did not find a good reason for `batch_first` to be `False` by default (including this forum): it seems like it only leads to two extra `transpose(1, 0)` calls before and after the LSTM layer without readability / performance improvements, but that’s probably a separate topic.