How to properly use hidden states for RNN

Hi everyone,

I’ve started using Pytorch and I really love it. However, I was wondering how to correctly use hidden states in a LSTM or GRU networks.

From what I understood from the tutorial, before each sample, we should reinitialize the hidden states (as well as cell states in LSTM).

Let’s suppose I have:

        if self.mode == 'GRU':
            self.document_rnn = nn.GRU(embedding_size, embedding_size, num_layers=self.nb_layers, bias=True, dropout=self.dropout, bidirectional=False, batch_first=True)
        elif self.mode == 'LSTM':
            self.document_rnn = nn.LSTM(embedding_size, embedding_size, num_layers=self.nb_layers, bias=True, dropout=self.dropout, bidirectional=False, batch_first=True)
        self.document_rnn_hidden = self.init_hidden()

and

    def init_hidden(self):
        document_rnn_init_h = nn.Parameter(nn.init.xavier_uniform(torch.Tensor(self.nb_layers, self.batch_size, self.embedding_size).type(torch.FloatTensor)), requires_grad=True)
        if self.mode == 'GRU':
            return document_rnn_init_h
        elif self.mode == 'LSTM':
            document_rnn_init_c = nn.Parameter(nn.init.xavier_uniform(torch.Tensor(self.nb_layers, self.batch_size, self.embedding_size).type(torch.FloatTensor)), requires_grad=True)
            return (document_rnn_init_h, document_rnn_init_c)

Is it correct to do something like this ?

for epoch in range(nb_epochs):
    for sample in samples():
        model.train(mode=True)
        optimizer.zero_grad()
        model.document_rnn_hidden = model.init_hidden()
        .... = model(xxx)
        loss = ...
        loss.backward()
        torch.nn.utils.clip_grad_norm(model.parameters(), args.gradient_clipping)
        optimizer.step()

I’ve seen this here: http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
But I’m confused because they don’t reinitialize the hidden states after training. Why ?

Thank you very much for your help !

They should.

To answer your first question. When your batches of data are independent short sequences, for example sentences of text, then you should reinitialise the hidden state before each batch. But if your data is made up of really long sequences like stock price data, and you cut it up into batches making sure that each batch follows on from the previous batch, then in that case you wouldn’t reinitialise the hidden state before each batch.

9 Likes

Great ! Thank you very much for your answer, it’s more clear !

By the way, I was wondering. Does it make sense to consider h & c as Parameter and therefore, requires_grad=True ? Shouldn’t be just Variables and requires_grad=False ?

h & c at each timestep are calculated from their previous values and the new input. It wouldn’t make sense to treat them as parameters and update them as you would with weights.

The only situation in which that could make sense would be if you wanted to train the initial state of h & c, and even then, only the initial values of h & c could be treated as Parameters with requires_grad=True.

2 Likes

Thank you very much to have confirmed my thoughts !