LSTM hidden state logic

Does it mean we are retaining the hidden states for each batch (not timesteps)? Why would one want to do that?

Yes exactly. Think of the hidden states as the learned weights. If you reset the hidden states after each batch, your network is essentially learning nothing. The hidden states control the gates (input, forget, output) of the LSTM and they carry information about what the network has seen so far. Therefore, your output depends not only on the most recent input, but also data it has seen in the past. This is the whole idea of the LSTM, it “removes” the long-term dependency problem. Read this excellent blog post for further information:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

If I want to initialize hidden state (e.g. randomly) but not retaining it for different batch, how should I do it?

Well, if you really want to, don’t save the hidden states as a class variable and instead initialiaze them after every batch. This could be done, e.g. by supplying them to your forward function as additional input, and in your train for loop you initialize them after every iteration. Again, I don’t recommend doing that because it would destroy the purpose of an LSTM.

4 Likes