Adding new hidden layer to LSTM

Hi,everyone!

My question is plain and simple.How can i add new hidden layer to LSTM?
I’m on my way through this tutorial:

http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#sphx-glr-beginner-nlp-sequence-models-tutorial-py

I want to add extra hidden layer to LSTM model and when i trying to set number of layers higher than 1 i get in trouble:

EMBEDDING_DIM = 6
HIDDEN_DIM = 9
hidden_layers = 3

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, hidden_layers, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim


        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, hidden_layers)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

IndexError: list index out of range

Then i trying add extra lstm layer upon existing like:

  def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        lstm_out, self.hidden = self.lstm(embeds.view(len(lstm_out), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

it works somehow without errors but i still can’t see it in model.modules().
It’s like

LSTMTagger(
  (word_embeddings): Embedding(9, 6)
  (lstm): LSTM(6, 9)
  (hidden2tag): Linear(in_features=9, out_features=3)
)
Embedding(9, 6)
LSTM(6, 9)
Linear(in_features=9, out_features=3)

I hope someone can help me with that.Thanks.

How about doing?

self.hidden2tag = nn.Linear(hidden_dim*hidden_layers, tagset_size)

Another thing to help you debug: are you initializing the hidden for hidden_layers or for 1 layer?

2 Likes

thanks for replying so fast.
But i still can’t get it.
Can you fit it within my code,cause my trys lead to new errors.

This is how i do it now:

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, hidden_layers, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim


        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, hidden_layers)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        self.hidden2 = self.init_hidden2()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.SFF
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))
    def init_hidden2(self):
        # Before we've done anything, we dont have any hidden state.SFF
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
       
        lstm_out, self.hidden2 = self.lstm(embeds.view(len(lstm_out), 1, -1), self.hidden2)

        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

and i get RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

In your original model you needed to modify the initialisation of the hidden layers.

def __init__(self, embedding_dim, hidden_dim, hidden_layers, vocab_size, tagset_size):
    ...
    self.hidden_layers = hidden_layers
    ...

def init_hidden(self):
    return (autograd.Variable(torch.zeros(self.hidden_layers, 1, self.hidden_dim)),
        autograd.Variable(torch.zeros(self.hidden_layers, 1, self.hidden_dim)))

That should suffice for adding layers to the LSTM with your original code.

The RuntimeError: Trying to backward through the graph a second time happens because you keep the hidden state between batches, and you don’t detach or repackage the hidden state in between batches. What happens is that loss.backward() is trying to back-propagate all the way through to the start of time, which works for the first batch but not for the second because the graph for the first batch has been discarded.

There are two possible solutions.

  1. detach/repackage the hidden state in between batches. There are (at least) three ways to do this.

    • self.hidden[0].detach_()
    • self.hidden[0] = self.hidden[0].detach()
    • self.hidden[0] = Variable(self.hidden[0].data, requires_grad=True)

    and similarly for self.hidden[1]

  2. replace loss.backward() with loss.backward(retain_graph=True) but know that each successive batch will take more time than the previous one because it will have to back-propagate all the way through to the start of the first batch.

3 Likes

Thanks,now it works.
I think topic can be closed.
But I just wondering is it better to create hidden layers with argument(first way in my first comment) or manually(like second way in my first comment)?
What pros and cons of them both?

Well, if you create them using the argument, then the code for LSTM can efficiently parallelise part of the calculation of the gates that requires the previous hidden state to be multiplied by a weight matrix - this can be done for all layers at once in one parallelised operation. I don’t know how much speedup that produces.

One disadvantage of using the argument is that all hidden layers have to have the same size.

I think that about sums up the pros and cons, so, unless you want hidden layers of varying sizes or stranger model architectures, the argument method seems to be the best and easiest option.

One more question.
When i try implement hidden layer manually using “self.hidden[0].detach_()” and other two as you advised,every time i get “TypeError: ‘tuple’ object does not support item assignment”.
I try it within batch loop and within forward method. It’s always the same.
How can i handle it?

You only need to use one of the three methods. In other words if you do

self.hidden[0].detach_()
self.hidden[1].detach_()

then that is enough, you don’t need to use the other two methods as well.

The other two methods will complain about TypeError: ‘tuple’ object does not support item assignment. In that case, the hidden state should be initialised as a list instead of a tuple.

def init_hidden(self):
    return [autograd.Variable(torch.zeros(self.hidden_layers, 1, self.hidden_dim)),
        autograd.Variable(torch.zeros(self.hidden_layers, 1, self.hidden_dim))]
1 Like

Yeah, i mean i tried all of them non-simultaneously.
Anyway,when i tring to place it into forward method or into training loop i still got “RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.”

Here,how i do it

class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, hidden_layers, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim, hidden_layers)

        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        self.hidden2 = self.init_hidden2()


    def init_hidden(self):
        
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def init_hidden2(self):
        
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)

        lstm_out, self.hidden2 = self.lstm(embeds.view(len(lstm_out), 1, -1), self.hidden2)

        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)

        # self.hidden[0].detach()
        # self.hidden[1].detach()

        return tag_scores



model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM,hidden_layers, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

for epoch in range(300):
    # model.hidden[0].detach()
    # model.hidden[1].detach()
    for sentence, tags in training_data:
        # model.hidden[0].detach()
        # model.hidden[1].detach()
       
        model.zero_grad()
        model.hidden = model.init_hidden()

        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        tag_scores = model(sentence_in)

        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()
        # model.hidden[0].detach()
        # model.hidden[1].detach()

Different commented blocks mean where i tried it but not simultaneously.

You need to do it for hidden2 as well as hidden.

Tried at the same places.
And i got the same error.

Weird. Can you post your current code?

I found it.
Actually,it was very dummie)
I typed model.hidden[0].detach() instead of model.hidden[0].detach_().
Now it works but i’m still confused.
When i print model.modules() i still cant see extra layer:

LSTMTagger(
  (word_embeddings): Embedding(9, 6)
  (lstm): LSTM(6, 9)
  (hidden2tag): Linear(in_features=9, out_features=3)
)
Embedding(9, 6)
LSTM(6, 9)
Linear(in_features=9, out_features=3)

Anyway,you already helped me alot.
Thank you very much.
Suppose, i’m gonna start another topic for that issue.

__init__ only adds one LSTM module, and that is what gets printed.

forward() uses that LSTM module twice with different hidden states but the same weights.

1 Like

What do you mean under “same weights”. I thought one LSTM net is fed to another LSTM net with same structure but different weights.
Should i initialize second one separately in init or leave it as it is?

The weights are initialised when you create the LSTM on the line

self.lstm = nn.LSTM(embedding_dim, hidden_dim, hidden_layers)

However in forward, you use it twice. So it uses the same weights twice.
If you want to use different sets of weights, you need to initialise a second LSTM.

self.lstm2 = nn.LSTM(embedding_dim, hidden_dim, hidden_layers)

let’s clarify it.
They have same weights when i initialize it first time but they have it different all the way it’s trained,right?
Otherwise,i can’t see any use in stacking one network upon another while their training paths are the same.

self.lstm has only one set of weights, you can check the source code.

Therefore when you reuse self.lstm it has to reuse the same weights for the second layer.

The training paths are not the same, true, but the gradients for the two paths just get added together.

1 Like