Seq2Seq RNN Not Working (or Learning?) -- Code Provided


I’m following the seq2seq translation tutorial but trying to do it with the inclusion of batch sizes.

Question 1: Is there a reason that my simple Encoder-Decoder (without attention) would keep predicting <GO> as the next word in the sequence? In training, using teacher forcing, my training loss decreases (see below):

Finished epoch 0 -- total_loss 1.41085, epoch time 6.15s
Finished epoch 1 -- total_loss 0.68840, epoch time 6.01s
Finished epoch 2 -- total_loss 0.47079, epoch time 6.00s
Finished epoch 3 -- total_loss 0.33526, epoch time 6.19s
Finished epoch 4 -- total_loss 0.23934, epoch time 6.07s

But even though the training loss is decreasing, when I generate the translations word-by-word, I get this:

--> Generated:  <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO>
--> Actual:  Je suis très content que l'école soit finie.
--> English Input:  I am very glad school is over.

Question 2: If the simple Encoder-Decoder without attention should work, then does anyone see any glaring errors in the code below. Note that I have adapted the tutorial code to be able to handle batch_sz greater than 1, as well as sequence_length greater than 1 (in encoder).

Code for Encoder-Decoder

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)

    def forward(self, input, hidden):
        batch_sz, seq_len = input.size()
        embedded = self.embedding(input)#.view(batch_sz, seq_len, -1)
        output = embedded
        for i in range(self.n_layers):
            output, hidden = self.gru(output, hidden)
        return output, hidden

    def init_hidden(self, batch_size):
        # variable of size [num_layers*num_directions, b_sz, hidden_sz]
        return Variable(torch.zeros(1, batch_size, self.hidden_size)).cuda()
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1):
        super(DecoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax()

    def init_hidden(self, batch_size):
        # variable of size [num_layers*num_directions, b_sz, hidden_sz]
        return Variable(torch.zeros(1, batch_size, self.hidden_size)).cuda() 
    def forward(self, input, hidden):
        batch_sz, seq_len = input.size()
        if seq_len != 1:
            raise Exception('IN DECODER: Sequence length is not 1...')
        output = self.embedding(input).view(batch_sz, seq_len, -1)
        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)
        # -- output of GRU is [b_sz, seq_len, hidden_sz]
        # -- can resize to [b_sz, hidden_sz] if seq_len == 1
        if (seq_len == 1):
            output = output.view(batch_sz, -1)
        output = self.softmax(self.out(output))
        return output, hidden

encoder = EncoderRNN(len(input_vocab), 128).cuda()
decoder = DecoderRNN(128, len(output_vocab)).cuda()

encoder_optim = torch.optim.Adam(encoder.parameters(), lr=0.001)
decoder_optim = torch.optim.Adam(decoder.parameters(), lr=0.001)

loss_function = nn.NLLLoss().cuda()

Training loop code:

losses = []
iters = 0
print_every = 20
old_val_loss = None

for epoch in range(20):
    start_epoch = timer()
    total_loss = torch.Tensor([0]).cuda()
    for idx, batch in enumerate(train_batches):
        if len(batch) == 0: continue
        # -- get the x_data, y_data from the english-french sequences
        # -- returns x_data.size() = [batch_sz, english_seq_len]
        # -- returns y_data.size() = [batch_sz, french_seq_len]
        x_data, y_data = get_sequence_data(batch, train=True)
        # -- initialize hidden layers
        hidden = encoder.init_hidden(x_data.size()[0])

        # -- zero gradients
        loss = 0
        # -- forward propagation for the whole sequence
        # -- return the output as [batch_sz, seq_len, hidden_sz]
        # -- note the encoder output is not used without attn
#         output, hidden = encoder(x_data, hidden)
        # -- try to forward propagate the sequence word by word...
        for word in range(x_data.size()[1]):
            output, hidden = encoder(x_data[:,word].unsqueeze(1), hidden)
        # -- loop for the decoder
        for word in range(y_data.size()[1]):
            # -- forward prop
            output, hidden = decoder(y_data[:,word].unsqueeze(1), hidden)
            # -- calculate the loss from each word in the sequence
            loss += loss_function(output, y_data[:,word])
        # -- calculate loss, backpropagation, update gradients/weights
        total_loss +=

I have originally tried to train both the encoder and the decoder with the entire sequence, but when that didn’t work I switched over to word-by-word. That still doesn’t work, though. Note that the get_sequence_data() returns a tuple of (x_data, y_data) where the x_data corresponds to the manually padded (by adding <PAD> at the end of the sequence) batch of source text (i.e., English phrases), and the y_data corresponds to manually padded translations with <GO> and <END> inserted at the beginning and end of each element in the sequence.

Side Note: I realize that manually padding the elements in the sequence rather than using the pad_padded_sequence() routine hurts my loss and learning a bit, because I’m accumulating loss on <PAD> elements – but I think the network should still learn at least something despite this aspect. It shouldn’t predict <GO> for every word. In fact, it should never predict <GO>.

Any ideas?

I’m not seeing how to delete this post - but I was actually able to figure out the mistake. Small bug - when calculating the loss, I was comparing the output to the current word, instead of the output to the future word. So when feeding in <GO>, I was accumulating loss if the network didn’t predict <GO>, essentially teaching it to predict the word I feed it because that’s what minimizes the loss.