BiLSTM does worse than regular LSTM?

Hi, it’s me again with questions about the language model example here.

When using a regular, unidirectional LSTM without weight tying, I’m getting perplexities of around 100, which is expected. Just by changing the model to a bidirectional LSTM (and its related changes), I’m getting perplexities around 1 for the test set, which doesn’t make any sense.

Anyone might have an idea why is this happening? If it was overfitting, I’d expect the test perplexities to be much higher. but they are also around 1…


I think I understand the problem. Since the target for language modeling is just the input sequence offset by one, the network “cheats” by using the other direction of the LSTM to know what word is coming next, thus it is nearly 100% confident of what the next word in the sequence will be.

I tried gradient clipping and weight tying, as used in the example I gave in the last post, but they can’t “overpower” the network by preventing it from cheating.

The solution for this is probably do what the BERT people did, and train a language model using another task, instead of predicting the next word in the sequence.

1 Like