Overview
I’m following the seq2seq translation tutorial but trying to do it with the inclusion of batch sizes.
Question 1: Is there a reason that my simple Encoder-Decoder (without attention) would keep predicting <GO>
as the next word in the sequence? In training, using teacher forcing, my training loss decreases (see below):
Finished epoch 0 -- total_loss 1.41085, epoch time 6.15s
Finished epoch 1 -- total_loss 0.68840, epoch time 6.01s
Finished epoch 2 -- total_loss 0.47079, epoch time 6.00s
Finished epoch 3 -- total_loss 0.33526, epoch time 6.19s
Finished epoch 4 -- total_loss 0.23934, epoch time 6.07s
But even though the training loss is decreasing, when I generate the translations word-by-word, I get this:
--> Generated: <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO> <GO>
--> Actual: Je suis très content que l'école soit finie.
--> English Input: I am very glad school is over.
Question 2: If the simple Encoder-Decoder without attention should work, then does anyone see any glaring errors in the code below. Note that I have adapted the tutorial code to be able to handle batch_sz greater than 1, as well as sequence_length greater than 1 (in encoder).
Code for Encoder-Decoder
class EncoderRNN(nn.Module):
def __init__(self, input_size, hidden_size, n_layers=1):
super(EncoderRNN, self).__init__()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.embedding = nn.Embedding(input_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
def forward(self, input, hidden):
batch_sz, seq_len = input.size()
embedded = self.embedding(input)#.view(batch_sz, seq_len, -1)
output = embedded
for i in range(self.n_layers):
output, hidden = self.gru(output, hidden)
return output, hidden
def init_hidden(self, batch_size):
# variable of size [num_layers*num_directions, b_sz, hidden_sz]
return Variable(torch.zeros(1, batch_size, self.hidden_size)).cuda()
class DecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, n_layers=1):
super(DecoderRNN, self).__init__()
self.n_layers = n_layers
self.hidden_size = hidden_size
self.embedding = nn.Embedding(output_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
self.out = nn.Linear(hidden_size, output_size)
self.softmax = nn.LogSoftmax()
def init_hidden(self, batch_size):
# variable of size [num_layers*num_directions, b_sz, hidden_sz]
return Variable(torch.zeros(1, batch_size, self.hidden_size)).cuda()
def forward(self, input, hidden):
batch_sz, seq_len = input.size()
if seq_len != 1:
raise Exception('IN DECODER: Sequence length is not 1...')
output = self.embedding(input).view(batch_sz, seq_len, -1)
for i in range(self.n_layers):
output = F.relu(output)
output, hidden = self.gru(output, hidden)
# -- output of GRU is [b_sz, seq_len, hidden_sz]
# -- can resize to [b_sz, hidden_sz] if seq_len == 1
if (seq_len == 1):
output = output.view(batch_sz, -1)
output = self.softmax(self.out(output))
return output, hidden
encoder = EncoderRNN(len(input_vocab), 128).cuda()
decoder = DecoderRNN(128, len(output_vocab)).cuda()
encoder_optim = torch.optim.Adam(encoder.parameters(), lr=0.001)
decoder_optim = torch.optim.Adam(decoder.parameters(), lr=0.001)
loss_function = nn.NLLLoss().cuda()
Training loop code:
losses = []
iters = 0
print_every = 20
old_val_loss = None
for epoch in range(20):
start_epoch = timer()
total_loss = torch.Tensor([0]).cuda()
for idx, batch in enumerate(train_batches):
if len(batch) == 0: continue
# -- get the x_data, y_data from the english-french sequences
# -- returns x_data.size() = [batch_sz, english_seq_len]
# -- returns y_data.size() = [batch_sz, french_seq_len]
x_data, y_data = get_sequence_data(batch, train=True)
# -- initialize hidden layers
hidden = encoder.init_hidden(x_data.size()[0])
# -- zero gradients
encoder.zero_grad()
decoder.zero_grad()
loss = 0
# -- forward propagation for the whole sequence
# -- return the output as [batch_sz, seq_len, hidden_sz]
# -- note the encoder output is not used without attn
# output, hidden = encoder(x_data, hidden)
# -- try to forward propagate the sequence word by word...
for word in range(x_data.size()[1]):
output, hidden = encoder(x_data[:,word].unsqueeze(1), hidden)
# -- loop for the decoder
for word in range(y_data.size()[1]):
# -- forward prop
output, hidden = decoder(y_data[:,word].unsqueeze(1), hidden)
# -- calculate the loss from each word in the sequence
loss += loss_function(output, y_data[:,word])
# -- calculate loss, backpropagation, update gradients/weights
loss.backward()
encoder_optim.step()
decoder_optim.step()
total_loss += loss.data
losses.append(total_loss[0])
I have originally tried to train both the encoder and the decoder with the entire sequence, but when that didn’t work I switched over to word-by-word. That still doesn’t work, though. Note that the get_sequence_data()
returns a tuple of (x_data, y_data)
where the x_data
corresponds to the manually padded (by adding <PAD>
at the end of the sequence) batch of source text (i.e., English phrases), and the y_data
corresponds to manually padded translations with <GO>
and <END>
inserted at the beginning and end of each element in the sequence.
Side Note: I realize that manually padding the elements in the sequence rather than using the pad_padded_sequence()
routine hurts my loss and learning a bit, because I’m accumulating loss on <PAD>
elements – but I think the network should still learn at least something despite this aspect. It shouldn’t predict <GO>
for every word. In fact, it should never predict <GO>
.
Any ideas?