LSTM hidden state logic

leolcling · June 17, 2019, 3:57am

Hi,

I am a bit confused about hidden state in LSTM. I am reading this tutorial, and in the forward method of the model, self.hidden is used as inputs h_0.

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(
            embeds.view(len(sentence), 1, -1), self.hidden)  # <- here self.hidden is used as h_0
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))  
        tag_scores = F.log_softmax(tag_space)
        return tag_scores

Does it mean we are retaining the hidden states for each batch (not timesteps)? Why would one want to do that?
If I want to initialize hidden state (e.g. randomly) but not retaining it for different batch, how should I do it? Thanks.

engelnico · June 17, 2019, 6:49am

Does it mean we are retaining the hidden states for each batch (not timesteps)? Why would one want to do that?

Yes exactly. Think of the hidden states as the learned weights. If you reset the hidden states after each batch, your network is essentially learning nothing. The hidden states control the gates (input, forget, output) of the LSTM and they carry information about what the network has seen so far. Therefore, your output depends not only on the most recent input, but also data it has seen in the past. This is the whole idea of the LSTM, it “removes” the long-term dependency problem. Read this excellent blog post for further information:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

If I want to initialize hidden state (e.g. randomly) but not retaining it for different batch, how should I do it?

Well, if you really want to, don’t save the hidden states as a class variable and instead initialiaze them after every batch. This could be done, e.g. by supplying them to your forward function as additional input, and in your train for loop you initialize them after every iteration. Again, I don’t recommend doing that because it would destroy the purpose of an LSTM.

leolcling · June 17, 2019, 7:07am

Thank you @engelnico. I think I am still confused on the reason we want to retain hidden state for different batch. Mainly because I read this https://discuss.pytorch.org/t/when-to-call-init-hidden-for-rnn/11518/7:

I think you should call
hidden = net.init_hidden(batch_size)
for every batch because, the hidden state after a batch pass contains information
about the whole previous batch. At test time you’d only have a new hidden state for every sentence so you probably want to train for that.

The argument in this comment is since we will use initial hidden states (say 0) in test time, we would want the model to learn that by implicitly resetting the hidden states for each samples (so the hidden states only pass across time, but not samples). Am I missing anything here? Thanks.

engelnico · June 17, 2019, 8:05am

I think i misunderstood your first question, I’m sorry…
There are usually two different modes for LSTM, stateless and stateful.

Stateless Mode updates the parameters for one batch and when the next batch comes, it will initialize the states again (with zeros). Therefore, the “cell memory” is reset after every batch.

Stateful Mode is the one I described above. The final states of the first batch is set to the initial state in the next batch. So it memorizes what was learned before

It is up to you and most importantly the type of data you use, which mode is the right one for you. The question you have to ask: Is your data time dependent across different batches? If yes -> Stateful, if not -> Stateless. And from your initial post, I guess it is not time dependent, so stateless mode is probably the right one for you.

Again sorry for the confusion

leolcling · June 18, 2019, 3:55am

Thank you @engelnico. Do you also know the behavior if I do not specify the hidden states (it will set to zero by default)? Is it stateful or stateless?

engelnico · June 24, 2019, 6:24am

Yes, if for the input (h_0, c_0) is not provided, both h_0 and c_0 default to zero. (source: documentation)

Rui_Ferreira · June 12, 2023, 3:37pm

Hi everyone, I have a question regarding the hidden and cell states. Do we initialize hidden state of LSTM during validation or testing or use the weights learned during training? (Do we initialize hidden state of LSTM during validation or testing or use the weights learned during training?)