Your learning rate might be too high, so that your model catapults itself out of a good region for its parameters.
Could you give some information on your training procedure?
I used the following SGD as the optimizers for encoder and decoder:
encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate, momentum=.9)
decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate, momentum=.9)
I used the NLLLose as the loss function:
criterion = nn.NLLLoss()
Then the process of training is begining:
for epoch …:
for eachsample …:
loss = train(source, target, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
for ei in range(input_length):
encoder_output, encoder_hidden = encoder(input_variable[ei], encoder_hidden)
encoder_outputs[ei] = encoder_output[0][0]
decoder_hidden = encoder_hidden # the last hidden in encoder is used as the initial hiddne in decoder
use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
if use_teacher_forcing:
for di in range(target_length):
decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
loss += criterion(decoder_output, target_variable[di])
decoder_input = target_variable[di] # Teacher forcing
else:
for di in range(target_length):
decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
topv, topi = decoder_output.data.topk(1)
ni = topi[0][0]
decoder_input = Variable(torch.LongTensor([[ni]]))
decoder_input = decoder_input.cuda() if use_cuda else decoder_input
loss += criterion(decoder_output, target_variable[di])