About bidirectional gru with seq2seq example and some modifications

Hi. I’m really new to pytorch. I was experimenting with code I found here:

http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py

I’m trying to replace the EncoderRNN with a bidirectional version. Here’s my code.

class EncoderBiRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderBiRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.bi_gru = nn.GRU(hidden_size, hidden_size, num_layers=1, batch_first=False,bidirectional=True)
        self.reverse_gru = nn.GRU(hidden_size,hidden_size, num_layers=1,batch_first=False,bidirectional=False)

        self.reverse_gru.weight_ih_l0 = self.bi_gru.weight_ih_l0_reverse
        self.reverse_gru.weight_hh_l0 = self.bi_gru.weight_hh_l0_reverse
        self.reverse_gru.bias_ih_l0 = self.bi_gru.bias_ih_l0_reverse
        self.reverse_gru.bias_hh_l0 = self.bi_gru.bias_hh_l0_reverse

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        #output, hidden = self.gru(output, hidden)

        bi_output, bi_hidden = self.bi_gru(output,hidden)
        reverse_output, reverse_hidden = self.reverse_gru(output,hidden)

        #return output, hidden
        return torch.cat((bi_output,reverse_output)), torch.cat((bi_hidden, reverse_hidden))

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result

Here’s the error.

Traceback (most recent call last):
File “pytorch.py”, line 744, in
n.trainIters(None, None, 75000, print_every=n.print_every)
File “pytorch.py”, line 646, in trainIters
decoder, encoder_optimizer, decoder_optimizer, criterion)
File “pytorch.py”, line 574, in train
input_variable[ei], encoder_hidden)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “pytorch.py”, line 85, in forward
bi_output, bi_hidden = self.bi_gru(output,hidden)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 190, in forward
self.check_forward_args(input, hx, batch_sizes)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 162, in check_forward_args
check_hidden_size(hidden, expected_hidden_size)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 154, in check_hidden_size
raise RuntimeError(msg.format(expected_hidden_size, tuple(hx.size())))
RuntimeError: Expected hidden size (2, 1, 256), got (1, 1, 256)

this is the link to where I read that bi directional RNNs needed to be put together in such a way.

What I’m looking for is advice on my code, how to write it so that it works.

2 Likes

if you specify bidirectional=True, pytorch will do the rest. The output will be (seq length, batch, hidden_size * 2) where the hidden_size * 2 features are the forward features concatenated with the backward features.

tldr, set bidirectional=True in the first rnn, remove the second rnn, bi_output is your new output. Also, not sure why you are setting gru weights as model params?

3 Likes

Thanks. I was hoping it was something simpler than what I was attempting. Thanks again.

does this line change to result = Variable(torch.zeros(2, 1, self.hidden_size)) or not?

edit –
also, how do you pass the hidden state from a bidirectional encoder to a decoder? Let’s say that the hidden dimension for the encoder is 256. Then you’d get output of 512, but would not the hidden state be 256 still? You might pass the output (dim of 512) to an encoder that has a hidden dim of 512, but then what do you do about using the hidden state? What is reccomended? Is this not an issue? Can you pass half the output to a smaller encoder or do you pass twice the hidden state to a larger encoder? Am I seeing this wrong?

1 Like

If you’re going to pass an encoder_hidden to your decoder you don’t even need the initHidden method. Your gru will automatically set the initial hidden state to zero, process the whole sequence and pop out an output and hidden_state.

There are a few ways you can pass these to a decoder. The easiest is to merge the forward and backward features with addition rather than concatenation, that way the dimensions stay the same.

 encoder_out = (encoder_out[:, :, :self.hidden_dim] +
                encoder_out[:, :, self.hidden_dim:])

You should also be able to pass the encoder hidden state to the decoder by passing the last layers encoder state to the first layer of the decoder. Optionally, if your encoder has the same or more layers than the decoder you could take the last n layers with n being the number of layers in the decoder.

decoder_hidden = encoder_hidden[-decoder.n_layers:] # take what we need from encoder

I’ll shamelessly link you to my own code for details:

1 Like

OK. Thanks. I’ll try adding the two halves of encoder_out
together. Thanks for your reply.

@D_Liebman I was also having trouble understanding the dimensions of the hidden state when I moved my encoder from one direction to bi-directional. The exact problem you had with initHidden. I was uncertain if I should have result = Variable(torch.zeros(1, 1, self.hidden_size)) or result = Variable(torch.zeros(2, 1, self.hidden_size)). When I tried torch.zeros(2,1,self.hidden_size), which was what I thought I was correct, I got an error that it can’t convert more than a single value to a python scalar, and so I went back. Not quite sure, sorry, but I’m here with you in being confused lol. Also, I was using it for self.hidden = self.initHidden() and storing the hidden state in the encoder class, so I think I do need that function.

I have another question about bidirectional seq2seq. Can you replace the GRU with an LSTM? LSTM works fine with attention, right? It seems the LSTM’s get very nice results in papers, and I don’t see a reason to not use an LSTM (but I’m having trouble implementing it).

I’ll keep up with this thread and help out if I figure stuff out.

once more I’m not sure I’m in the right place.
hi again. I’m trying to use your Decoder class with attention. below is some code and my most recent error message. can you look at it and give me some feedback? If you want i’ll file it as an issue on github. Thanks.

def train(self,input_variable, target_variable, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = Variable(torch.zeros(2, 1, self.hidden_size)) 

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_variable.size()[0]
    target_length = target_variable.size()[0]

    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size  ))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_variable[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0][0]

    decoder_input = Variable(torch.LongTensor([[SOS_token]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input
    decoder_output = decoder_input

    decoder_hidden = encoder_hidden
    
    encoder_outputs = encoder_outputs.view(1,max_length,self.hidden_size)

    if True:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(decoder_output, encoder_output, decoder_hidden)
            loss += criterion(decoder_output, target_variable[di])
            decoder_input = target_variable[di]  # Teacher forcing

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.data[0] / target_length
    
    
if __name__ == '__main__':
encoder = EncoderBiRNN(input_lang.n_words, hidden_size )

decoder = Decoder(output_lang.n_words, hidden_size ,hidden_size, 1 ,dropout=0.1)

train(input_variable, target_variable, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)

Traceback (most recent call last):
File “pytorch.py”, line 837, in
n.trainIters(None, None, 75000, print_every=n.print_every, learning_rate=lr)
File “pytorch.py”, line 728, in trainIters
decoder, encoder_optimizer, decoder_optimizer, criterion)
File “pytorch.py”, line 663, in train
decoder_output, decoder_hidden, decoder_attention = decoder(decoder_output, encoder_output, decoder_hidden)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “pytorch.py”, line 222, in forward
decoder_hidden)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 190, in forward
self.check_forward_args(input, hx, batch_sizes)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 162, in check_forward_args
check_hidden_size(hidden, expected_hidden_size)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 154, in check_hidden_size
raise RuntimeError(msg.format(expected_hidden_size, tuple(hx.size())))
RuntimeError: Expected hidden size (1, 1, 512), got (2, 1, 512)

I’m not sure what I’m doing wrong but the error is with the hidden size and I believe it has to do with the initialized state again.

It’s telling you the problem right here :wink:

RuntimeError: Expected hidden size (1, 1, 512), got (2, 1, 512)

these hidden states are (num_layers * num_directions, batch_size, hidden_size) so when you turn on the bi-directional flag it doubles the first dim. You can either just take the last layer (or num decoder layers) on the first dim like i did in my helper classes

decoder_hidden = encoder_hidden[-decoder.n_layers:]
source: https://github.com/A-Jacobson/minimal-nmt/blob/master/decoding_helpers.py

or you could reshape it to (1, 1, 1024) and sum across the last dimension like we did with the encoder output.

The second option is probably strictly more correct as you’d get the hidden state for both directions but the first option works fine.

Your model may work without this part but you shouldn’t have to initialize the encoder hidden state and you should be able to feed batches of full sequences to the encoder (this will be much much faster than feeding one item at a time like your code). Also, the encoder and decoder can share the same optimizer.

Hi, I’m trying to follow your repository very closely. Still, I get the following error all the time. btw, the nmt repository of yours is great. If I were to use it in a paper, how would you want me to give you attribution?

Traceback (most recent call last):
File “pytorch.py”, line 930, in
n.trainIters(None, None, 75000, print_every=n.print_every, learning_rate=lr)
File “pytorch.py”, line 781, in trainIters
decoder, encoder_optimizer, decoder_optimizer, criterion)
File “pytorch.py”, line 713, in train
output, decoder_hidden, mask = decoder(output, encoder_output, decoder_hidden)
File “/home/dave/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “pytorch.py”, line 240, in forward
context, mask = self.attention(decoder_hidden[:-1], encoder_out) # 1, 1, 50 (seq, batch, hidden_dim)
File “/home/dave/.local/lib/python3.6/site-packages/torch/autograd/variable.py”, line 78, in getitem
return Index.apply(self, key)
File “/home/dave/.local/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py”, line 89, in forward
result = i.index(ctx.index)
ValueError: result of slicing is an empty tensor

I got the code to run by changing decoder_hidden[:-1] to just decoder_hidden but something seems wrong.

It’s hard for me to tell what’s wrong without knowing what decoder_hidden is at that point or the exact architecture you are using but I can guess.

Is it possible that you’re only using one layers in your decoder? If that’s the case you don’t need to grab the last state for attention you can use the hidden state as is. in fact, if you try to index like that you will get an empty list… like this:

>>> x = [1] # list of length 1
>>> x[:-1]
[]  # <-- empty list

I’m using n_layers = 2 in that repo so do have to grab the last state like that if I want to use the n_layers argument instead of explicitly splitting my decoder rnns.

As for attribution I’m not entirely sure as I don’t have an academic background. Would it help if I added a DOI? https://guides.github.com/activities/citable-code/#intro

using two layers in the decoder fixes it. Thanks. Yep, a doi would be good, but some open source license is what I was thinking about. thanks for all your time. finding someone doing this is wonderful.

Hi again. I’m trying out leaving the encoder output concatinated. I commented out the lines that added the two halves of the encoder output together. I find I have a problem with passing the hidden state from the encoder to the decoder. What is a good thing to do here? I am currently taking the hidden state and concatenating it with itself, making a object that is twice the size. The question is, does that pass any meaningfull info to the decoder? You’ve been very helpful and everything. I thought I’d ask you.

1 Like

hey! @D_Liebman been a while I hope you discovered the answer elsewhere, but if you’re going to change the size of the encoder hidden state you have to change the number of channels the attention and decoder rnn layers expect as well. Additionally, you will have to reshape the encoder hidden layer so that the size is doubled only on the channel dimension. I haven’t seem much literature on addition vs concat, but intuitively since the gradients can flow through both operations the info is getting passed either way. Also, empirically I haven’t noticed much difference in my small scale experiments so I tend to stick with addition, as it makes the dimensions much cleaner. As a counter point, of course, harvard nlp and https://arxiv.org/pdf/1703.03906.pdf, and most of the google papers seem to use concat.

thanks. i will go back.

Question. Lets say I had a 4 layer bidirectional lstm, what if I wish to implement a fc inbetween rnn layers to perform skip connections “identity mapping”. How would we code out the solution?