Stacking LSTM hidden layers vs using the last one for language models

sigma_x · January 4, 2021, 6:00pm

In several tutorials on n-gram language models (e.g. this) the output of LSTM is stacked:

out ,_=lstm(input)
out = out.contiguous().view(-1, self.n_hidden)

which is dimensionality (batch_size*seq_length, hidden_features) and becomes the input in the next layer. Target is also resized to batch_size*seq_length.

I don’t understand this approach as it’s the job of LSTM to learn the relationship among inputs in the sequence. The correct output should be

out,_ = lstm(input)
out = out[:,-1,:]

the last state for each batch, and become the input in the next layer, and the target is size batch_size.

So why use the full sequence then?

vdw · January 6, 2021, 2:44am

The problem is that the article is not very well document, so it’s not easy to see the important details without knowing what to look for. Let me try to make the comparison between your approach and the one in the article.

Let’s assume you have a long text based on which you want to build your language model, e.g.:
A B C D E F G H I H K L M N O …

Your approach is not wrong

out,_ = lstm(input)
out = out[:,-1,:]

but only assumes the training data of the form (input sequence, next word). So you training data might look like:
A, B
A B, C
A B C, D
B C D, E
C D E, F
…
(assuming a max. length of 3 for the input sequences)

The method in the article does something different. The important details is # The targets, shifted by one. So here the dataset looks like:
A B C, B C D
B C D, C D E
C D E, D E F
…

If you take all outputs of the LSTM – i.e., not just after the last step – you implicitly get (A, B) after the 1st step, (A B, C) after the 2nd step and (A B C, D) after the 3rd/last step for the first data item (A B C, B C D).

There’s no fundamental difference between both approaches, but the 2nd is more efficient.

sigma_x · January 8, 2021, 1:10pm

Thanks, it makes sense. So at each stage LSTM encodes implicitly the sequence so far, e.g. A, AB, ABC, etc. So If I had a batch of 2: (D|A, B,C) and (E|B, C,D), I would have these sequences for a fourgram:

Batch1:
A->A
B->AB
C->ABC

batch 2:
B->B
C->BC
D->BCD

So If I were to build a language model, I’d need a total of 2x4xvocabulary size labels: (A, B, C, D) and (B, C, D, E) rather than (D) and (E).