Hello. I was going through Pytorch official tutorial regarding the implementation of seq2eq Chatbot with Attention. To gain better understanding I used different methods to achieve similar results. One of as I thought improvements was the implementation of simple nn.NLLLoss function with attribute ignore_index = padding_token. Previous solution that used binary mask tensor describing the padding of the target tensor seemed to be too complicated, I wanted a simpler solution, but as it tuned out adding few lines of code made everything worse.

`criterion = nn.NLLLoss(ignore_index=PAD_token, reduction='mean')`

In the definition of train function I only made small change in the loop when we forward propagate through decoder one step at a time:

```
if use_teacher_forcing:
for t in range(max_target_len):
decoder_output, decoder_hidden = decoder(
decoder_input, decoder_hidden, encoder_outputs
)
# Teacher forcing: next input is current target
decoder_input = target_variable[t].view(1, -1)
# Calculate and accumulate loss
loss = criterion(decoder_output, target_variable[t])
total_loss += loss
print_losses.append(loss.item())
# Perform backpropatation
total_loss.backward()
# Clip gradients: gradients are modified in place
_ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
_ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)
# Adjust model weights
encoder_optimizer.step()
decoder_optimizer.step()
return sum(print_losses)/max_target_len
```

Output:

Iteration: 1; Percent complete: 0.0%; Average loss: 8.5246

Iteration: 2; Percent complete: 0.1%; Average loss: 8.5403

Iteration: 3; Percent complete: 0.1%; Average loss: 8.5663

Iteration: 4; Percent complete: 0.1%; Average loss: 8.5691

Iteration: 5; Percent complete: 0.1%; Average loss: 8.4637

Iteration: 6; Percent complete: 0.1%; Average loss: 8.5378

Iteration: 7; Percent complete: 0.2%; Average loss: 8.5575

Iteration: 8; Percent complete: 0.2%; Average loss: 8.5145

Iteration: 9; Percent complete: 0.2%; Average loss: 8.5717

Iteration: 10; Percent complete: 0.2%; Average loss: 8.5122

Iteration: 11; Percent complete: 0.3%; Average loss: 8.5587

Iteration: 12; Percent complete: 0.3%; Average loss: 8.5491

Iteration: 13; Percent complete: 0.3%; Average loss: 8.5933

Iteration: 14; Percent complete: 0.4%; Average loss: 8.5231

Iteration: 15; Percent complete: 0.4%; Average loss: 8.5239

… an so on

Total loss increase or fluctuate ate the same value, but it doesn’t want to decrease. Did anybody try to use any kind loss function with ignore_index in seq2seq model with variate length sequences and has some positive results? nn.CrossEntropyLoss behaves similarly.

I would like to emphasize that tensors passed to loss function have the following dimensions:

loss = criterion(decoder_output, target_variable[t])

decoder_output: (batch_size, vocab_size)

target_variable[t]: (batch_size)

Maybe there is a problem? Are they a proper dimensions, or should I reshape these tensors? In original code they calculate negative log manually passing to it “concatenated” tensors with final shape: (batch_size):

crossEntropy = -torch.log(torch.gather(inp, 1, target.view(-1, 1)).squeeze(1))

I tried to change learning rate, but it didn’t help at all.

One additional question. In my version I used:

packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True, **enforce_sorted=False**), thus I didn’t sort batch in descending order according to sequence lengths. I noticed that my code process slower than original, could be that cause by using unsorted batch, or it should work well anyway?

Any suggestion would be very appreciated.

Best regards.