Hello. I was going through Pytorch official tutorial regarding the implementation of seq2eq Chatbot with Attention. To gain better understanding I used different methods to achieve similar results. One of as I thought improvements was the implementation of simple nn.NLLLoss function with attribute ignore_index = padding_token. Previous solution that used binary mask tensor describing the padding of the target tensor seemed to be too complicated, I wanted a simpler solution, but as it tuned out adding few lines of code made everything worse.
criterion = nn.NLLLoss(ignore_index=PAD_token, reduction='mean')
In the definition of train function I only made small change in the loop when we forward propagate through decoder one step at a time:
if use_teacher_forcing:
for t in range(max_target_len):
decoder_output, decoder_hidden = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
# Teacher forcing: next input is current target
decoder_input = target_variable[t].view(1, -1)
# Calculate and accumulate loss
loss = criterion(decoder_output, target_variable[t])
total_loss += loss
print_losses.append(loss.item())
# Perform backpropatation
total_loss.backward()
# Clip gradients: gradients are modified in place
_ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
_ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)
# Adjust model weights
encoder_optimizer.step()
decoder_optimizer.step()
return sum(print_losses)/max_target_len
Output:
Iteration: 1; Percent complete: 0.0%; Average loss: 8.5246
Iteration: 2; Percent complete: 0.1%; Average loss: 8.5403
Iteration: 3; Percent complete: 0.1%; Average loss: 8.5663
Iteration: 4; Percent complete: 0.1%; Average loss: 8.5691
Iteration: 5; Percent complete: 0.1%; Average loss: 8.4637
Iteration: 6; Percent complete: 0.1%; Average loss: 8.5378
Iteration: 7; Percent complete: 0.2%; Average loss: 8.5575
Iteration: 8; Percent complete: 0.2%; Average loss: 8.5145
Iteration: 9; Percent complete: 0.2%; Average loss: 8.5717
Iteration: 10; Percent complete: 0.2%; Average loss: 8.5122
Iteration: 11; Percent complete: 0.3%; Average loss: 8.5587
Iteration: 12; Percent complete: 0.3%; Average loss: 8.5491
Iteration: 13; Percent complete: 0.3%; Average loss: 8.5933
Iteration: 14; Percent complete: 0.4%; Average loss: 8.5231
Iteration: 15; Percent complete: 0.4%; Average loss: 8.5239
… an so on
Total loss increase or fluctuate ate the same value, but it doesn’t want to decrease. Did anybody try to use any kind loss function with ignore_index in seq2seq model with variate length sequences and has some positive results? nn.CrossEntropyLoss behaves similarly.
I would like to emphasize that tensors passed to loss function have the following dimensions:
loss = criterion(decoder_output, target_variable[t])
decoder_output: (batch_size, vocab_size)
target_variable[t]: (batch_size)
Maybe there is a problem? Are they a proper dimensions, or should I reshape these tensors? In original code they calculate negative log manually passing to it “concatenated” tensors with final shape: (batch_size):
crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))
I tried to change learning rate, but it didn’t help at all.
One additional question. In my version I used:
packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True, enforce_sorted=False), thus I didn’t sort batch in descending order according to sequence lengths. I noticed that my code process slower than original, could be that cause by using unsorted batch, or it should work well anyway?
Any suggestion would be very appreciated.
Best regards.